Registration Dossier

Data platform availability banner - registered substances factsheets

Please be aware that this old REACH registration data factsheet is no longer maintained; it remains frozen as of 19th May 2023.

The new ECHA CHEM database has been released by ECHA, and it now contains all REACH registration data. There are more details on the transition of ECHA's published data to ECHA CHEM here.

Diss Factsheets

Physical & Chemical properties

Partition coefficient

Currently viewing:

Administrative data

Link to relevant study record(s)

Referenceopen allclose all

Endpoint:
partition coefficient
Type of information:
experimental study
Adequacy of study:
key study
Study period:
12 Jan 2023 - 30 Jan 2023
Reliability:
1 (reliable without restriction)
Rationale for reliability incl. deficiencies:
guideline study
Qualifier:
according to guideline
Guideline:
OECD Guideline 107 (Partition Coefficient (n-octanol / water), Shake Flask Method)
Version / remarks:
27 Jul 1995
GLP compliance:
yes (incl. QA statement)
Type of method:
flask method
Partition coefficient type:
other: n-Octanol and Buffer pH 6
Analytical method:
high-performance liquid chromatography
Key result
Type:
log Pow
Partition coefficient:
< -2.52
Temp.:
21 °C
pH:
6
Details on results:
The partition coefficient is defined as the ratio of the equilibrium concentrations of a dissolved item in a two phase system consisting two largely immiscible solvents. In the case of n-octanol and water:
Pow = c n-octanol / c water
The partition coefficient is dimensionless and usually given in form of its logarithm to base ten.
A partition coefficient was calculated from data of each run. Altogether six values were obtained. These six log Pow values determined laid within a range of ± 0.3 units.
Conclusions:
According to this study the logarithmic partition coefficient (n-octanol/buffer pH 6) of the test item was determined to be
< -2.52
using the shake flask method at 21 °C ± 1 °C.
Executive summary:

The partition coefficient was calculated based on the results of the analytical determination following a shake flask test according to OECD Guideline 107. The selection of the buffer was recommended by the sponsor. The pH of the zwitterionic form (pH 6) was chosen for the main test, because the net charge of the molecule is considered to be 0 at this pH.
The test item content was determined by via HPLC-UV. The log Pow was calculated for each run. The difference between the three different test variants was less than the given ± 0.3 log units. Therefore, the study is considered to be valid.
According to this study the logarithmic partition coefficient (n-octanol/buffer pH 6) of the test item was determined to be
< -2.52
using the shake flask method at 21 °C ± 1 °C.

Endpoint:
partition coefficient
Type of information:
(Q)SAR
Adequacy of study:
supporting study
Reliability:
2 (reliable with restrictions)
Rationale for reliability incl. deficiencies:
results derived from a valid (Q)SAR model and falling into its applicability domain, with limited documentation / justification
Justification for type of information:
1. SOFTWARE
EPISuite 4.1

2. MODEL (incl. version number)
KOWWIN (LOGKOW) Version 1.68

3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL
CC(N)C(=O)NC(Cc1ccc(O)cc1)C(=O)(O)

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL

- Defined endpoint: logP

- Unambiguous algorithm:
KOWWIN uses a "fragment constant" methodology to predict log P.  In a "fragment constant" method, a structure is divided into fragments (atom or larger functional groups) and coefficient values of each fragment or group are summed together to yield the log P estimate.   KOWWIN’s methodology is known as an Atom/Fragment Contribution (AFC) method.  Coefficients for individual fragments and groups were derived by multiple regression of 2447 reliably measured log P values.  KOWWIN’s "reductionist" fragment constant methodology (i.e. derivation via multiple regression) differs from the "constructionist" fragment constant methodology of Hansch and Leo (1979) that is available in the CLOGP Program (Daylight, 1995).  See the Meylan and Howard (1995) journal article for a more complete description of KOWWIN’s methodology.
To estimate log P, KOWWIN initially separates a molecule into distinct atom/fragments.  In general, each non-hydrogen atom (e.g. carbon, nitrogen, oxygen, sulfur, etc.) in a structure is a "core" for a fragment; the exact fragment is determined by what is connected to the atom.  Several functional groups are treated as core "atoms"; these include carbonyl (C=O), thiocarbonyl (C=S), nitro (-NO2), nitrate (ONO2), cyano (-C/N), and isothiocyanate (-N=C=S).  Connections to each core "atom" are either general or specific; specific connections take precedence over general connections.  For example, aromatic carbon, aromatic oxygen and aromatic sulfur atoms have nothing but general connections; i.e., the fragment is the same no matter what is connected to the atom.  In contrast, there are 5 aromatic nitrogen fragments: (a) in a five-member ring, (b) in a six-member ring, (c) if the nitrogen is an oxide-type {i.e. pyridine oxide}, (d) if the nitrogen has a fused ring location {i.e. indolizine}, and (e) if the nitrogen has a +5 valence {i.e. N-methyl pyridinium iodide}; since the oxide-type is most specific, it takes precedence over the other four.  The aliphatic carbon atom is another example; it does not matter what is connected to -CH3, -CH2-, or -CH< , the  fragment is the same; however, an aliphatic carbon with no hydrogens has two possible fragments: (a) if there are four single bonds with 3 or more carbon connections and (b) any other not meeting the first criteria.
It became apparent, for various types of structures, that log P estimates made from atom/fragment values alone could or needed to be improved by inclusion of  substructures larger or more complex than "atoms"; hence, correction factors were added to the AFC method.  The term "correction factor" is appropriate because their values are derived from the differences between the log P estimates from atoms alone and the measured log P values.  The correction factors have two main groupings: first, factors involving aromatic ring substituent positions and second,  miscellaneous factors.  In general, the correction factors are values for various steric interactions, hydrogen-bondings, and effects from polar functional substructures.  Individual correction factors were selected through a tedious process of correlating the differences (between log P estimates from atom/fragments alone and measured log P values) with common substructures.
Two separate regression analyses were performed.  The first regression related log P to atom/fragments of compounds that do not require correction factors (i.e., compounds estimated adequately by fragments alone).  The general regression equation has the following form:

 log P  = Σ(fini ) +  b     (Equation 1)

where Σ(fini )  is the summation of fi (the coefficient for each atom/fragment) times ni (the number of times the atom/fragment occurs in the structure) and b  is the linear equation constant.  This initial regression used 1120 compounds of the 2447 compounds in the total training dataset.
The correction factors were then derived from a multiple linear regression that correlated differences between the experimental (expl) log P and the log P estimated by Equation 1 above with the correction factor descriptors.  This regression did not utilize an additional equation constant.  The equation for the second regression is:

 lop P (expl)  -  log P (eq 1)  = Σ(cjnj )       (Equation 1)

where Σ(cjnj )  is the summation of cj (the coefficient for each correction factor) times nj  (the number of times the correction factor occurs (or is applied) in the molecule).

- Defined domain of applicability:
Appendix D lists (for each fragment) the maximum number of instances of that fragment in any of the 2447 training set compounds and  10946 validation set compounds (the minimum number of instances is of course zero, since not all compounds had every fragment).  The minimum and maximum values for molecular weight are the following:
Training Set Molecular Weights:
Minimum MW:  18.02
Maximum MW:  719.92
Average MW:  199.98
 
Validation Molecular Weights:
Minimum MW:  27.03
Maximum MW:  991.15
Average MW:  258.98
Currently there is no universally accepted definition of model domain.  However, users may wish to consider the possibility that log P estimates are less accurate for compounds outside the MW range of the training set compounds, and/or that have more instances of a given fragment than the maximum for all training set compounds.  It is also possible that a compound may have a functional group(s) or other structural features not represented in the training set, and for which no fragment coefficient was developed.  These points should be taken into consideration when interpreting model results.

- Appropriate measures of goodness-of-fit and robustness and predictivity:
To be effective an estimation method must be capable of making accurate predictions for chemicals not included in the training set.  Currently, KOWWIN has been tested on an external validation dataset of 10,946 compounds (compounds not included in the training set). The validation set includes a diverse selection of chemical structures that rigorously test the predictive accuracy of any model.  It contains many chemicals that are similar in structure to chemicals in the training set, but also many chemicals that are different from and structurally more complex than chemicals in the training set.  The average molecular weight of compounds in the validation set is 258.98 versus 199.98 for the training set. The external validation set of 10946 compounds contains 372 compounds that exceed the domain of instances of a given fragment or correction factor maximum for all training set compounds.  
The estimation accuracy for these compounds is:
Exceed Fragment Instance Domain - Accuracy Statistics:
number in dataset      = 372
correlation coef (r2)  = 0.939
standard deviation     = 0.731
absolute deviation     = 0.564
avg Molecular Weight   = 460.0

Exceed Molecular Weight Domain - Accuracy Statistics:
number in dataset      = 103
correlation coef (r2)  = 0.879
standard deviation     = 0.815
absolute deviation     = 0.619
avg Molecular Weight   = 802.16

Exceed BOTH Fragment & MW Domain - Accuracy Statistics:
number in dataset      = 75
correlation coef (r2)  = 0.879
standard deviation     = 0.905
absolute deviation     = 0.706
avg Molecular Weight   = 812.70
Appendix F lists the 75 compounds that exceed both the fragment instance domain and molecular limit domain.


5. APPLICABILITY DOMAIN
The substance exhibits a molecular weight lying within the range of both the training set and the validation set. Furthermore, since the model uses the constant fragment methodology, common groups within the substance are used to estimate the partition coefficient. L-Alanyl-L-tyrosine is a dipeptide with the commom core structure of an amino acid, namely the amino- and the carboxy-group both of which can be found in the training and the validation set.
Further, for a similar chemical, the dipeptide L-Alanyl-l-glutamine experimental data for the kogKow is available (logKow -4.6). The estimation using the model results in a kogKow of -2.76 for L-Alanyl-l-glutamine supporting the applicability and the use of the model for such substances.


6. ADEQUACY OF THE RESULT
L-Alanyl-L-tyrosine is a dipeptide with the commom core structure of an amino acid, namely the amino- and the carboxy-group both of which can be found in the training and the validation set. Therefore, the results obtained by the present estimation provide reliable values for the classification / risk assessment.
Qualifier:
no guideline followed
Principles of method if other than guideline:
Method: other (calculated): KOWWIN (LOGKOW(c)) Program, Version 1.68, Syracuse Research Corporation, Merrill Lane, Syracuse, New York, 13210, U.S.A, 2000
GLP compliance:
no
Type of method:
calculation method (fragments)
Partition coefficient type:
octanol-water
Key result
Type:
log Pow
Partition coefficient:
-0.03
Temp.:
25 °C
Remarks on result:
other: pH was not reported
Conclusions:
Using the EPISuite progam interface with KOWWIN (LOGKOW(c)) Program, Version 1.68 to estimate the partition coefficient. Log Pow was calculated to be -0.03. L-alanyl-L-tyrosine is therefore considered to be hydrophilic.
Endpoint:
partition coefficient
Type of information:
(Q)SAR
Adequacy of study:
supporting study
Reliability:
2 (reliable with restrictions)
Rationale for reliability incl. deficiencies:
results derived from a valid (Q)SAR model and falling into its applicability domain, with adequate and reliable documentation / justification
Justification for type of information:
1. SOFTWARE
ACD/Labs 2019.2.0 (Build 3285. 16 Jan 2020)

2. MODEL (incl. version number)
ACD/Labs 2019.2.0 (Build 3285. 16 Jan 2020)

3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL
Smiles: CC(N)C(=O)NC(Cc1ccc(O)cc1)C(O)=O

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL
- Defined endpoint:
log Kow (log P) – The logarithm of a ratio of concentrations of un-ionized compound between its solutions in n-octanol and water: LogKo/w

The dataset used to develop the reported model has been compiled from a great number of different sources covering a wide variety of experimental protocols used to determine log Ko/w values reported within them. This includes the classical potentiometric log Ko/w determination methods involving phase titrations, as well as more contemporary and most modern chromatographic methods like HPLC on standard and modified (immobilized artificial membrane (IAM) and liposome chromatography) resins or capillary electrophoresis and centrifugal partition chromatography. Since log Ko/w takes into account only partition of neutral species, when the method involves only single data point measurement (i.e. the log Ko/w is not determined by extrapolation from a pH dependence curve), the water phase is usually buffered to a pH in which the predominant state of the analyzed compound is neutral. For a comprehensive overview of the experimentallog Ko/w measurement techniques please see [1].

log Ko/w is a relatively easily measured property. As a result the experimental data quality, which is usually inversely proportional to the complexity of the experiment, is reasonably good. Independent external studies show that the error between the logKo/w measurements performed by different laboratories using the same protocol (reproducibility) can be expected to be within 0.5 logarithmic units [2].

- Unambiguous algorithm:
Linear fragmental QSAR model
Summation of additive increments of the following types:
(i) carbon atoms that are not doubly or triply connected to any heteroatom (so called Isolating Carbons, or ICs);
(ii) functional groups (FGs) obtained after removing all ICs from a given molecule;
(iii) inter-fragmental interactions (FIs) between any pair of FGs separated by up to four isolating atoms (larger atom chains are considered if direct conjugation between the interacting groups can occur, e.g. in naphtalene and larger aromatic policycles).

log Ko/w = SUM[1..n](ai*f(IC)i) + SUM[1..n](bi*f(FG)i) + SUM[1..n](ci*f(FI)i) +intercept

where f(IC)i, f(FG)i, and f(FI)I is the occurence count of i-th isolating carbon, functional group, and
interaction respectively; ai, bi, and ci - corresponding statistical coefficients.

Increments of ICs and FGs depend on the type of constituent atoms (including branching and cyclization), whereas increments of FIs depend on the interacting groups and separating atoms (type and number) inbetween. If new fragments are involved (that were not present in the training set), their increments are estimated through summation of the "polar atom increments" (that are different from "conventional" IC and FG increments due to internal conjugation or alpha-effect). Missing interaction increments (FIs) involving new functional groups come from either bi-directional Hammet-type equations (for aliphatic interactions) or "generalized atom chains" (for aromatic interactions), the latter assuming that the effect depends on the first atom of a given "missing group" (that is directly attached to Isolating Carbon through which inter-fragmental interaction takes place). New FG and FI increments can also be supplied through the "algorithm training" feature, when software automatically identifies all missing fragments and interactions (and estimates their increments) based on user's own data. In addition, special algorithms were applied for calculating +/- uncertainty errors that depend on increments (and compound classes from the training set) used in calculation (see description of Applicability Domain)

Descriptors in the model:
Fragmental descriptors dimentionless (occurence count) Combination of atomic, fragmental and inter-fragmental descriptors (see Explicit Algorithm)

Descriptor selection:
All increments were derived using the "constructionist" approach that considers step-wise analysis of available compounds in the order of increasing their structural complexity. It starts with simple hydrocarbons (yielding parameters of various Isolating Carbons), then mono-functional compounds (yielding parameters of functional groups), then bi-functional compounds (with clear interactions between two functional groups, yielding parameters of pair-wise inter-fragmental interactions), and so on. After each step the obtained parameters are generalized (to cover the maximum extend of structural diversity) and fixed for the next step of analysis (to minimize ambiguity of new parameters' physicochemical meaning). The exact procedure is described in [3], whereas the resulting efficiency is evaluated in [4] and [5].

Algorithm and descriptor generation:
Algorithm involves (i) splitting structure into Isolating Carbons, functional groups and pairs of interacting groups, (ii) estimating all increments (from pre-defined tables and/or secondary algorithms, as described above), (iii) summing all increments to obtain "global" log Ko/w value, and (iv) estimating the uncertainty error that depends on the least certain increments used in above calculations. All descriptors are generated using the "IC-method" as described in [3,4,6].

Chemicals/Descriptors ratio:
Number of compounds in training set 3,600, number of parameters in the most accurate algorithm (+/-0.3 or better) - 2,500, in the less accurate algorithm (+/-0.5 or worse, when inter-fragmental interactions are estimated by the secondary algorithms) - 500. The large number of parameters comes from the sake of physico-chemical clarity of inter-fragmental interactions (that assigns all correction factors into classical types of inductive, resonance and H-bonding interactions). This number could be easily reduced without a substantial loss of accuracy of predictions (as discussed in [4-6]), yet this would diminish the physicochemical clarity of each parameter and make the algorithm no different from many other "purely statistical calculations"

- Defined domain of applicability:
ACD/Log P is applicable to all types of compounds with molecular weight up to 1,000 Daltons, log Ko/w between -2 and +12, and not bearing any heavy metals that may form coordinating bonds.

Method used to assess the applicability domain:
Depends on the types and numbers of functional groups and interactions between groups, each of which is assigned with particular uncertainty value. Very roughly, the following levels of parameter uncertainty are used:
(i) +/- 0.3 ("very accurate"), when all parameters came from a stepwise analyses of large compound class with consistent data, e.g. any types of hydrocarbons and most types of mono-functional compounds with common polar groups, such as -OH or -CONH2;
(ii) +/- 0.5 ("moderately accurate"), when some parameters came from smaller set of compounds and/or data of weaker consistence, e.g. involving less-common groups such as -NHC(=O)NHC(=S)-, or interacting pairs of groups;
(iii) +/- 1.0 ("poorly accurate"), when some parameters were estimated by "secondary algorithms" (i.e., query structure involved new fragments that were not present in the training set).

On a more delicate level, various combinations of different parameter uncertainty are considered (as described in [3]). All error calculations are based on empirical comparisons of class-specific predictions to experimental data that was not used in algorithm development.

- Appropriate measures of goodness-of-fit and robustness and predictivity:
No data available

- Mechanistic interpretation:
Additivity-consitutivity considerations residing on bi-directional Hammett type equations [3, 4].

[1]Avdeev, A., Absorption and Drug Development: Solubility, Permeability, and Charge State, John Wiley & Sons, Inc., Hoboken, NJ, 2003.
[2]Kishi, H. and Hashimoto, Y., Evaluation of the procedures for the measurement of water solubility and n-octanol/water partition coefficient of chemicals results of a ring test in Japan, Chemosphere, 1989, 18, 1749- 1759.
[3]Petrauskas, A., Kolovanov, E., ACD/Log P Method Description. Persp. In Drug Design, 2000, 19, 1-19
[4]Japertas, P., Didziapetris, R., Petrauskas, A., Fragmental Methods in the Design of New Compounds. Applications of Advanced Algorithm Builder. Quant. Struct.-Act. Relat., 2002, 21, 23-37
[5]Mannhold, R., Petrauskas, A., Substructure versus Whole Molecule Approaches for calculating Log P. QSAR Combi. Sci., 2003, 22, 466-475
[6]Japertas, P., Didziapetris, R., Petrauskas, A., Fragmental Methods in the Analysis of Biological Activities of Diverse Compound Sets. Mini. Rev. Med. Chem., 2003, 3, 797-808

5. APPLICABILITY DOMAIN
The substance has a molecular weight < 1000 and the predicted logKow is between -2 and +12 and therefore fits in the applicability domain. Further, the prediction is calssified as moderate accurate (+/- 0.51).

6. ADEQUACY OF THE RESULT
The substance fits in the applicability domain of the model. The prediction is valid and can be used for classification and risk assessment.

Principles of method if other than guideline:
- Justification of QSAR prediction: see field 'Justification for type of information'
GLP compliance:
no
Type of method:
calculation method (fragments)
Partition coefficient type:
octanol-water
Type:
log Pow
Partition coefficient:
-0.05
Remarks on result:
other: QSAR
Remarks:
Calculated LogP: -0.05 +/- 0.51
Conclusions:
In this study report the partition coefficient of L-Alanyl-L-tyrosine (CAS 3061-88-9) was estimated by using the classic logP modul from ACD / Percepta 19.2.0. The logP of L-Alanyl-L-tyrosine is considered to be -0.05
Endpoint:
partition coefficient
Type of information:
(Q)SAR
Adequacy of study:
supporting study
Reliability:
2 (reliable with restrictions)
Rationale for reliability incl. deficiencies:
results derived from a valid (Q)SAR model and falling into its applicability domain, with adequate and reliable documentation / justification
Justification for type of information:
1. SOFTWARE
ACD/Labs 2019.2.0 (Build 3285. 16 Jan 2020)

2. MODEL (incl. version number)
ACD/Labs 2019.2.0 (Build 3285. 16 Jan 2020)

3. SMILES OR OTHER IDENTIFIERS USED AS INPUT FOR THE MODEL
Smiles: CC(N)C(=O)NC(Cc1ccc(O)cc1)C(O)=O

4. SCIENTIFIC VALIDITY OF THE (Q)SAR MODEL
[Explain how the model fulfils the OECD principles for (Q)SAR model validation. Consider attaching the QMRF or providing a link]
- Defined endpoint:
log Kow (log P) – The logarithm of a ratio of concentrations of un-ionized compound between its solutions in n-octanol and water: LogKo/w
The dataset used to develop the reported model has been compiled from a great number of different sources covering a wide variety of experimental protocols used to determine log Ko/w values reported within them. This includes the classical potentiometric log Ko/w determination methods involving phase titrations, as well as more contemporary and most modern chromatographic methods like HPLC on standard and modified (immobilized artificial membrane (IAM) and liposome chromatography) resins or capillary electrophoresis and centrifugal partition chromatography. Since log Ko/w takes into account only partition of neutral species, when the method involves only single data point measurement (i.e. the log Ko/w is not determined by extrapolation from a pH dependence curve), the water phase is usually buffered to a pH in which the predominant state of the analyzed compound is neutral. For a comprehensive overview of the experimentallog Ko/w measurement techniques please see [1].
log Ko/w is a relatively easily measured property. As a result the experimental data quality, which is usually inversely proportional to the complexity of the experiment, is reasonably good. Independent external studies show that the error between the logKo/w measurements performed by different laboratories using the same protocol (reproducibility) can be expected to be within 0.5 logarithmic units [2].

Experimental data from various sources have been used. The characteristics of the entire dataset compiled for the task of this model development is:
No. of compounds = 16277
Min. Value = -5.08
Max. Value = 11.29
Std. Dev. = 1.92
Skewness = 0.22

- Unambiguous algorithm:
Global linear baseline QSAR + local similarity based corrections The global QSAR was developed using PLS in combination with bootstrapping technique. This method implies random compound sampling
from the initial training set, i.e. generation of new “training sub-sets”.

Each of the sampled sub-sets is of the same size as the initial training set, however, random manner of their population results in some compounds being selected more than once, others being omitted. This procedure is performed 100 times and an independent PLS model is derived for every sub-set.
Each of those PLS models is based on 2D fragmental descriptors:

log Ko/w = SUM[i=1..n](ai*fi)+ c

where fi is the number of occurences of the i-th fragment in a molecule, ai - its statistical coefficient, and c - intercept.

As a result, each global QSAR model actually represents an ensemble of 100 PLS models, providing each compound with a vector of 100 log Ko/w predictions, each based on a slightly different sub-set of the initial training set. It is defined that two compounds with similar trends in the variation patterns of the 100 value vectors predicted by a global QSAR model are considered similar in terms of the analyzed property, i.e. the differences in the compound sets used to parameterize each of 100 PLS models, constituting a baseline model, affect estimations for the two compounds in a similar way. The correlation coefficient of the two vectors is called an Individual Similarity Index between two compouds (SIi). An analogous definition of the “property-specific” or dynamic similarity was first used by Tetko and his co-workers [3-7] and this method has been recently used in the analysis of the acute toxicity data [8].

With the available robust similarity measure, it becomes possible to analyse the performance of the baseline QSAR model in the local chemical environment of a query molecule represented by the most similar compounds in the training set. In case any systematic errors are encountered for sufficiently similar compounds, a local correction (Δ) is calculated.
Later on it is possible to train the model quickly and efficiently using new experimental data by just adding it to this second similarity correction calculation procedure, without the time costly baseline model re-training.
Descriptors in the model:
Fragmental descriptors dimentionless (occurence count) Fixed set of fragmental descriptors, based on the expanded list of Platt's type fragments (see [9]). A fixed and relatively small set of fragments was used due to the specifics of the employed modeling methodology. In order for the correlation between two compound vectors of log Ko/w predictions coming from a baseline QSAR model to be representative of compound similarity in terms of the analyzed property, these vectors have to be parameterized using exactly the same set of fragmental descriptors. This prevents the use of any sort of automated fragmentation routines (atom based, isolating carbon based, chain based, etc.) that result in a dynamic set of fragments depending on the training set structures. They leave the possibility that for any query structure from outside the training set the same rules will yield certain new fragments not encountered in the training set molecules which is not compatible with the main condition just mentioned. On the other hand, it is equally important for the model to be able to identify any new structural features of a query molecule that were not present in the training set compounds. I.e., the fixed fragment set cannot be constructed based on the analysis of the training set either, or in general any molecule set whatsoever. Because in that case any new structural features not present in that database would be eventually ignored. As a result, the fragmental descriptor set is based on the general knowledge and considerations regarding all possible chemical structures rather than a finite dataset and include all the fragments, even those that are not detected in the training set molecules at all.

Descriptor selection:
The last fact mentioned in Section 4.3 also excludes the possibility to employ any of the usual descriptor selection techniques relying on the generation of a large initial pool of various descriptors and its subsequent reduction during the statistical analysis (exclusion of statistically insignificant, intercorrelated variables, etc.). Such an analysis by definition would have to be based on a certain dataset, and would not allow having “blank” fragments in the final variable set.

Algorithm and descriptor generation:
The generation of the descriptor matrix following the outlined approach constituted counting the occurences of any of the pre-defined fragments in the trainig set molecules. This procedure as well as all the subsequent statistical analysis were performed using Algorithm Builder 1.8 software.

Software name and version for descriptor generation:
Algorithm Builder 1.8
ACD/Labs, Inc. 110 Yonge Street, 14th floor, Toronto, Ontario, Canada M5C 1T4.
http://www.acdlabs.com

Chemicals/Descriptors ratio:
30.2 (11387 chemicals in the training set, 377 descriptors)

- Defined domain of applicability:
Applicability domain of the model is defined based on the training set compounds. This procedure takes into account the following two aspects:
* Similarity of the tested compound to the training set. No reliable predictions can be made if we have no similar compounds in the training set;
* Consistence of the experimental values with regard to the baseline model for similar compounds. Even if we do have similar compounds in the dataset the quality of prediction could be lower if that data cannot be reproduced by the baseline model. It does not matter what the reason for this inconsistency – experimental variability or sudden change in mechanism of action because of slight structural changes – in any case it indicates possible problems when trying to give accurate predictions

Method used to assess the applicability domain:
The two aspects mentioned above receive their quantitative assessment in terms of Similarity Index (SI) and Data- Model Consistency Index (DMCI). The SI, evaluating how distant the query structure is from the whole training set, is calculated by weighted averaging of all the individual Similarity Indices (S/i) for the test molecule and each of the 5 most similar compounds from the training set. DMCI is calculated by comparing the differences between experimental and global QSAR predicted values for the 5 most similar compounds and the suggested similarity correction value (Δ) for the test compound, calculated by averaging these differences. The more individual differences are scattered around the calculated average (Δ), the more inconsistent are the data for the similar compounds with regards to the global QSAR model.
The final prediction Reliability Index is calculated as a product of the aforementioned two indices:
RI = SI * DMCI
Both SI and DMCI are scaled to vary from 0 to 1, so the resulting RI also varies in this range. Lower values suggest a compound being further from the Model Applicability Domain and the prediction less reliable (low SI or low DMCI either alone or in combination can be the reason). On the other hand, high RI values indicate an increasing confidence about the quality of the prediction (both SI and DMCI have to be high to yield such a result).

Limits of applicability:
Reliability Index < 0.3

- Appropriate measures of goodness-of-fit and robustness and predictivity:
The statistics of the training set data:
No. of compounds = 11387
Min. Value = -5.08
Max. Value = 11.29
Std. Dev. = 1.94
Skewness = 0.25

Statistics provided for the fraction of the training set that falls within the aplicability domain of the model (RI > 0.3 - see Section 5.4)
NRI>0.3 = 11371 (i.e. 99.9% of the training set compounds)
R2 = 0.944
Std. Dev. = 0.457
RMSE = 0.457
F = 402696.2 (Fisher's F-statistics)

The statistics of the validation set data:
No. of compounds = 4890
Min. Value = -4.64
Max. Value = 10.89
Std. Dev. = 1.90
Skewness = 0.16

Random splitting of the initial dataset into the training and validation sets using the ratio 70%:30%.

Statistics provided for the fraction of the validation set that falls within the aplicability domain of the model (RI > 0.3 - see Section 5.4)
NRI>0.3 = 4872 (i.e. 99.6% of all the validation set compounds)
R2 = 0.940
Std. Dev. = 0.464
RMSE = 0.464
F = 165247.5 (Fisher's F-statistics)

Analysis of the subsets of the higher quality results
NRI>0.5 = 4772 (i.e. 97.6% of all the validation set compounds)
R2 = 0.945 Std. Dev. = 0.444
RMSE = 0.444 F = 177716.6 (Fisher's F-statistics)
NRI>0.75 = 3345 (i.e. 68.4% of all the validation set
compounds)
R2 = 0.964 Std. Dev. = 0.360 RMSE = 0.360
F = 197041.9 (Fisher's F-statistics)

- Mechanistic interpretation:
Mechanistic basis of the model:
The only mechanistic consideration utilized in model building is the use of a linear regression method (PLS) and the fragmental descriptors. In other words it is assumed that the final predicted value is composed of a linear combination of all the contributions of structural moieties making up the test molecule. Although very basic, this consideration is one of the most fundamental ones, even the name of (Q)SAR methods implies that the main determinant of all the properties of a compound is its structure. Quite obviously fragments are the best and realy firsthand descriptors of a chemical structure.

A priori or a posteriori mechanistic interpretation:
A posteriori model interpretation results are consistent with generaly understood mechanistic factors or scientific interpretations and well documented experimental facts. I.e., the top ten fragmental descriptors with negative coefficients are the following:
Any positive permanent charge = -2.436
Quaternary ammonium = -1.612
Permanent charge on aromatic N, O, S, Se = -1.317
Sulfonic acid = -1.125
alpha-Amino acid = -0.965
N-oxide = -0.674
tertiary amine (>N-) = -0.673
=S< = -0.670
Any phosphorus atom = -0.573
Lactone = -0..404
Some of those fragments are very well known because of their effect of increasing hydrophilicity of a compound. One more classical example of such water phase favorable group, i.e., hydroxy fragment, follows this TOP10 almost immediately with a statistical coefficient of -0.400
Among the groups with the largest positive coefficients, the absolute majority of them can be clearly expected to increase the hydrophobic properties of a compound, e.g.:
Bicyclo [3.1.1] scaffold = 1.103
Spiro [5.2] scaffold = 1.066
Any Si atom = 0.714
Spiro [6.6] = 0.678
Spiro [6.5] = 0.644
Fused 6:5:5 scaffold = 0.614
Stereohindrance in the form of two bulk branched aliphatic substituents in both orto- positions of a phenolic group = 0.460
n-Pentyl chain = 0.452
n-Heptyl chain = 0.442
Aromatic sulphur =0.419
Note: the average of all 377 statistical coefficients is 0.018
All the fragments encoding strong H-bonding in the aromatic system (e.g., orto-keto, orto-thioketo, orto-nitro, or orto-halogenated phenols and anilines - 6 descriptors in total) have positive coefficients which is in agreement with the known fact that H-Bonding reduces hydrophilicity.
The coefficients of 6 fragments mentioned range from +0.005 to +0.455 with an average of +0.15.
Further similar examples can be established as well.

[1]Avdeev, A., Absorption and Drug Development: Solubility, Permeability, and Charge State, John Wiley & Sons, Inc., Hoboken, NJ, 2003.
[2]Kishi, H. and Hashimoto, Y., Evaluation of the procedures for the measurement of water solubility and n-octanol/water partition coefficient of chemicals results of a ring test in Japan, Chemosphere, 1989, 18, 1749- 1759.
[3]I.V. Tetko, Neural network studies. 4. Introduction to associative neural networks, J. Chem. Inf. Comput. Sci. 2002, 42, 717-728.
[4]I.V. Tetko and P. Bruneau, Application of ALOGPS to predict 1-octanol/water distribution coefficients, logP, and logD, of AstraZeneca inhouse database, J. Pharm. Sci. 2004, 93, 3103-3110.
[5]I.V. Tetko and V.Y. Tanchuk, Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program, J. Chem. Inf. Comput. Sci. 2002, 42, 1136-1145.
[6]H. Zhu, A. Tropsha, D. Fourches, A. Varnek, E. Papa, P. Gramatica, T. Oberg, P. Dao, A. Cherkasov, and I.V. Tetko, Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis, J. Chem. Inf. Model. 2008, 48, 766-784.
[7]I.V. Tetko, I. Sushko, A.K. Pandey, H. Zhu, A. Tropsha, E. Papa, T. Oberg, R. Todeschini, D. Fourches, and A. Varnek, Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection, J. Chem. Inf. Model. 2008, 48, 1733-1746.
[8]Sazonovas, A., Japertas, P., and Didziapetris, R., Estimation of reliability of predictions and model applicability domain evaluation in the analysis of acute toxicity (LD50), SAR QSAR Environ. Res. 2010, 21, 127-148.
[9]J.A. Platts, D. Butina, M.H. Abraham, and A. Hersey, Estimation of molecular linear free energy relation descriptors using a group contribution approach, J. Chem. Inf. Comput. Sci. 1999, 39, 835-845.

5. APPLICABILITY DOMAIN
The reliability Index for the prediction is high (RI=0.78) indicating that the substance is in the applicability domain.

6. ADEQUACY OF THE RESULT
The substance fits in the applicability domain of the model. The prediction is valid and can be used for classification and risk assessment.
Qualifier:
no guideline followed
Principles of method if other than guideline:
- Software tool(s) used including version: ACD/Labs 2019.2.0 (Build 3285. 16 Jan 2020)
- Model(s) used: ACD/LogP GALAS 
- Model description: see field 'Justification for type of information'
- Justification of QSAR prediction: see field 'Justification for type of information'
GLP compliance:
no
Type of method:
other: QSAR estimation
Partition coefficient type:
octanol-water
Key result
Type:
log Pow
Partition coefficient:
-0.31
Temp.:
25 °C
Remarks on result:
other: QSAR estimation; pH not reported; Reliability index = 0.78 (moderate)
Details on results:
Experimental results for the most similar structure reported by the program:
Val-Tyr (CAS 3061-91-4 ; LogP (used in model): -0.02; Similarity: 0.87 experimental LogP : -2.52 (pH 7)
Miki Akamatsu, Yohji Yoshida, Hideaki Nakamura, Masaaki Asao, Hajime Iwamura, Toshio Fujita. Hydrophobicity of Di‐ and Tripeptides Having Unionizable Side Chains and Correlation with Substituent and Structural Parameters. Struct.-Act. Relat. 8 (1989) 195-203
Conclusions:
In this study report the partition coefficient of L-Alanyl-L-tyrosine (CAS 3061-88-9) was estimated by using the GALAS logP model of the program ACD / Percepta 19.2.0. The logP of L-Alanyl-L-tyrosine is considered to be -0.31

Description of key information

The logarithmic partition coefficient (n-octanol/buffer pH 6) of the test item was determined to be < -2.52 using the shake flask method at 21 °C ± 1 °C.

Key value for chemical safety assessment

Additional information

Three QSAR estimations were performed using three different models. L-Alanyl-L-tyrosine and its structural fragments respectively are within the applicability domain of each of the used models. 


An OECD 107 test study for the partition coefficient of L-Alanyl-L-tyrosine dihydrate was performed. The data from the test study is used as key value for chemical safety assessment.