Decision tree

Series 2.1.1 - data collection and preprocessing

Machine learning projects
Tree models
Data preprocessing
Pandas
ChEMBL database
Python
Author

Jennifer HY Lin

Published

September 19, 2023

Series overview
  • Post 1 (this post) - data collection from ChEMBL database using web resource client in Python, with initial data preprocessing

  • Post 2 - more data preprocessing and transformation to reach the final dataset prior to model building

  • Post 3 - estimating experimental errors and building decision tree model using scikit-learn


Introduction

I’ve now come to a stage to do some more machine learning (ML) work after reading a few peer-reviewed papers about ML and drug discovery. It seemed that traditional ML methods were still indispensible performance-wise, and when used in combination with deep learning neural networks, they tend to increase prediction accuracy more. I also haven’t ventured into the practicality and usefulness of large language models in drug discovery yet, but I’m aware work in this area has been started. However, comments from experienced seniors did mention that they are still very much novel and therefore may not be as useful yet. Although by the speed of how things evolve in the so-called “AI” field, this possibly may change very soon. Also from what I can imagine, molecular representations in texts or strings are not quite the same as natural human language texts, since there are a lot of other chemistry-specific features to consider, e.g. chiralities, aromaticities and so on. Because of this, I’m sticking with learning to walk first by trying to cover conventional ML methods in a more thorough way, before trying to run in the deep learning zone.

So this leads to this series of posts (3 in total) about decision tree. Previously, I’ve only lightly touched on a commonly used classifier algorithm, logistic regression, as the first series in the ML realm. Reflecting back, I think I could’ve done a more thorough job during the data preparation stage. So this would be attempted this time. The data preparation used here was carried out with strong reference to the materials and methods section in this paper (Tilborg, Alenicheva, and Grisoni 2022), which was one of the papers I’ve read. There are probably other useful methods out there, but this paper made sense to me, so I’ve adopted a few of their ways of doing things during data preprocessing.


Data retrieval

This time I decided to try something new which was to use the ChEMBL web resource client to collect data (i.e. not by direct file downloads from ChEMBL website, although other useful way could be through SQL queries, which is also on my list to try later). I found this great online resource about fetching data this way from the TeachOpenCADD talktorial on compound data acquisition. The data retrieval workflow used below was mainly adapted from this talktorial with a few changes to suit the selected dataset and ML model.

The web resource client was supported by the ChEMBL group and was based on a Django QuerySet interface. Their GitHub repository might explain a bit more about it, particularly the Jupyter notebook link provided in the repository would help a lot regarding how to write code to search for specific data.

To do this, a few libraries needed to be loaded first.

# Import libraries
# Fetch data through ChEMBL web resource client
from chembl_webresource_client.new_client import new_client

# Dataframe library
import pandas as pd

# Progress bar
from tqdm import tqdm

To see what types of data were provided by ChEMBL web resource client, run the following code and refer to ChEMBL documentations to find out what data were embedded inside different data categories. Sometimes, it might not be that straight forward and some digging would be required (I went back to this step below to find the “data_validity_comment” when I was trying to do some compound sanitisations actually).

Note

The link provided above also talked about other useful techniques for data checks in the ChEMBL database - a very important step to do during data preprocessing, which was also something I was trying to cover and achieve as much as possible in this post.

available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)
['activity', 'activity_supplementary_data_by_activity', 'assay', 'assay_class', 'atc_class', 'binding_site', 'biotherapeutic', 'cell_line', 'chembl_id_lookup', 'compound_record', 'compound_structural_alert', 'description', 'document', 'document_similarity', 'drug', 'drug_indication', 'drug_warning', 'go_slim', 'image', 'mechanism', 'metabolism', 'molecule', 'molecule_form', 'official', 'organism', 'protein_classification', 'similarity', 'source', 'substructure', 'target', 'target_component', 'target_relation', 'tissue', 'xref_source']

Resource objects were created to enable API access as suggested by the talktorial.

# for targets (proteins)
targets_api = new_client.target

# for bioactivities
bioact_api = new_client.activity

# for assays
assay_api = new_client.assay

# for compounds
cpd_api = new_client.molecule

Checked object type for one of these API objects (e.g. bioactivity API object).

type(bioact_api)
chembl_webresource_client.query_set.QuerySet


Fetching target data

A protein target e.g. acetylcholinesterase was randomly chosen by using UniProt to look up the protein UniProt ID.

# Specify Uniprot ID for acetylcholinesterase
uniprot_id = "P22303"

# Get info from ChEMBL about this protein target, 
# with selected features only
targets = targets_api.get(target_components__accession = uniprot_id).only(
    "target_chembl_id",
    "organism", 
    "pref_name", 
    "target_type"
)

The query results were stored in a “targets” object, which was a QuerySet with lazy data evaluation only, meaning it would only react when there was a request for the data. Therefore, to see the results, the “targets” object was then read through Pandas DataFrame.

# Read "targets" with Pandas
targets = pd.DataFrame.from_records(targets)
targets
organism pref_name target_chembl_id target_type
0 Homo sapiens Acetylcholinesterase CHEMBL220 SINGLE PROTEIN
1 Homo sapiens Acetylcholinesterase CHEMBL220 SINGLE PROTEIN
2 Homo sapiens Cholinesterases; ACHE & BCHE CHEMBL2095233 SELECTIVITY GROUP

Selected the first protein target from this dataframe.

# Save the first protein in the dataframe
select_target = targets.iloc[0]
select_target
organism                    Homo sapiens
pref_name           Acetylcholinesterase
target_chembl_id               CHEMBL220
target_type               SINGLE PROTEIN
Name: 0, dtype: object

Then saved the selected ChEMBL ID for the first protein (to be used later).

chembl_id = select_target.target_chembl_id
# Check it's saved
print(chembl_id)
CHEMBL220


Fetching bioactivity data

Obtaining bioactivity data for the selected target.

bioact = bioact_api.filter(
    # Use the previously saved target ChEMBL ID
    target_chembl_id = chembl_id, 
    # Selecting for Ki
    standard_type = "Ki",
    # Requesting exact measurements
    relation = "=",
    # Binding data as "B"
    assay_type = "B",
).only(
    "activity_id",
    "data_validity_comment"
    "assay_chembl_id",
    "assay_description",
    "assay_type",
    "molecule_chembl_id",
    "standard_units",
    "standard_type",
    "relation",
    "standard_value",
    "target_chembl_id",
    "target_organism"
)

# Check the length and type of bioactivities object
print(len(bioact), type(bioact))
706 <class 'chembl_webresource_client.query_set.QuerySet'>

To have a quick look at the data being held inside each entry of the bioactivity dataset, e.g. for first entry.

print(len(bioact[0]), type(bioact[0]))
bioact[0]
15 <class 'dict'>
{'activity_id': 111024,
 'assay_chembl_id': 'CHEMBL641011',
 'assay_description': 'Inhibition constant determined against Acetylcholinesterase (AChE) receptor.',
 'assay_type': 'B',
 'data_validity_comment': 'Potential transcription error',
 'molecule_chembl_id': 'CHEMBL11805',
 'relation': '=',
 'standard_type': 'Ki',
 'standard_units': 'nM',
 'standard_value': '0.104',
 'target_chembl_id': 'CHEMBL220',
 'target_organism': 'Homo sapiens',
 'type': 'Ki',
 'units': 'nM',
 'value': '0.104'}

The next step might take a few minutes - downloading the QuerySet as a Pandas DataFrame.

bioact_df = pd.DataFrame.from_dict(bioact)

bioact_df.head(3)
activity_id assay_chembl_id assay_description assay_type data_validity_comment molecule_chembl_id relation standard_type standard_units standard_value target_chembl_id target_organism type units value
0 111024 CHEMBL641011 Inhibition constant determined against Acetylc... B Potential transcription error CHEMBL11805 = Ki nM 0.104 CHEMBL220 Homo sapiens Ki nM 0.104
1 118575 CHEMBL641012 Inhibitory activity against human AChE B None CHEMBL208599 = Ki nM 0.026 CHEMBL220 Homo sapiens Ki nM 0.026
2 125075 CHEMBL641011 Inhibition constant determined against Acetylc... B None CHEMBL60745 = Ki nM 1.63 CHEMBL220 Homo sapiens Ki nM 1.63

Checked total rows and columns in the bioactivities dataframe.

bioact_df.shape
(706, 15)


Preprocess bioactivity data

When I reached the second half of data preprocessing, an alarm bell went off regarding using half maximal inhibitory concentration (IC50) values in ChEMBL. I remembered reading recent blog posts by Greg Landrum about using IC50 and inhibition constant (Ki) values from ChEMBL. A useful open-access paper (Kalliokoski et al. 2013) from 2013 also looked into this issue about using mixed IC50 data in ChEMBL, and provided a thorough overview about how to deal with situations like this. There was also another paper (Kramer et al. 2012) on mixed Ki data from the same author group in 2012 that touched on similar issues.

To summarise both the paper about IC50 and blog posts mentioned above:

  • it would be the best to check the details of assays used to test the compounds to ensure they were aligned and not extremely heterogeneous, since IC50 values were very assay-specific, and knowing that these values were extracted from different papers from different labs all over the world, mixing them without knowing was definitely not a good idea

  • the slightly better news was that it was more likely okay to combine Ki values for the same protein target from ChEMBL as they were found to be adding less noise to the data (however ideally similar data caution should also apply)

  • it was also possible to mix Ki values with IC50 values, but the data would need to be corrected via using a conversion factor of 2.0 to convert Ki values to IC50 values (note: I also wondered if this needed to be re-looked again since this paper was published 10 years ago…)

Because of this, I decided to stick with Ki values only for now before adding more complexities as I wasn’t entirely confident about mixing IC50 values with Ki values yet. Firstly, I checked for all types of units being used in bioact_df. There were numerous different units and formats, which meant they would need to be converted to nanomolar (nM) units first.

bioact_df["units"].unique()
array(['nM', 'M', 'uM', None, 'pM', "10'-9M", "10'-3M", "10'-6M",
       "10'-10M", '/min/M', "10'5/M/min", "10'2/M/min", "10'3/M/min",
       "10'8/M/min", "10'7/M/min", 'microM/L', 'umol/L', 'mM',
       "10'4/M/min", "10'6/M/min", 'mM/min', '10^8M'], dtype=object)

Checking again that I’ve fetched Ki values only.

bioact_df["standard_type"].unique()
array(['Ki'], dtype=object)

It looked like there were duplicates of columns on units and values, so the “units” and “value” columns were removed and “standard_units” and “standard_value” columns were kept instead. Also, “type” column was dropped as there were already a “standard_type” column.

Note

Differences between “type” and “standard_type” columns were mentioned by this ChEMBL blog post.

bioact_df.drop(["units", "value", "type"], axis = 1, inplace = True)
# Re-check df
bioact_df.head(3)
activity_id assay_chembl_id assay_description assay_type data_validity_comment molecule_chembl_id relation standard_type standard_units standard_value target_chembl_id target_organism
0 111024 CHEMBL641011 Inhibition constant determined against Acetylc... B Potential transcription error CHEMBL11805 = Ki nM 0.104 CHEMBL220 Homo sapiens
1 118575 CHEMBL641012 Inhibitory activity against human AChE B None CHEMBL208599 = Ki nM 0.026 CHEMBL220 Homo sapiens
2 125075 CHEMBL641011 Inhibition constant determined against Acetylc... B None CHEMBL60745 = Ki nM 1.63 CHEMBL220 Homo sapiens
bioact_df.dtypes
activity_id               int64
assay_chembl_id          object
assay_description        object
assay_type               object
data_validity_comment    object
molecule_chembl_id       object
relation                 object
standard_type            object
standard_units           object
standard_value           object
target_chembl_id         object
target_organism          object
dtype: object

The column of “standard_value” was converted from “object” to “float64” so we could use the Ki values for calculations later.

bioact_df = bioact_df.astype({"standard_value": "float64"})
# Check column data types again
bioact_df.dtypes
activity_id                int64
assay_chembl_id           object
assay_description         object
assay_type                object
data_validity_comment     object
molecule_chembl_id        object
relation                  object
standard_type             object
standard_units            object
standard_value           float64
target_chembl_id          object
target_organism           object
dtype: object

Then the next step was taking care of any missing entries by removing them in the first place. I excluded “data_validity_comment” column here as this was required to check if there were any unusual activity data e.g. excessively low or high Ki values. A lot of the compounds in this column probably had empty cells or “None”, which ensured that there were no particular alarm bells to the extracted bioactivity data.

bioact_df.dropna(subset = ["activity_id", "assay_chembl_id", "assay_description", "assay_type", "molecule_chembl_id", "relation",  "standard_type", "standard_units", "standard_value", "target_chembl_id", "target_organism"], axis = 0, how = "any", inplace = True)
# Check number of rows and columns again (in this case, there appeared to be no change for rows)
bioact_df.shape
(706, 12)

Since all unique units inside the “units” and “values” columns were checked previously, I’d done the same for the “standard_units” column to see the ones recorded in it.

bioact_df["standard_units"].unique()
array(['nM', '/min/M', "10'5/M/min", "10'2/M/min", "10'3/M/min",
       "10'8/M/min", "10'7/M/min", "10'4/M/min", "10'6/M/min", 'mM/min',
       '10^8M'], dtype=object)

There were a mixture of different units.

# Check for number of non-nM units
bioact_df[bioact_df["standard_units"] != "nM"].shape[0]
61

There appeared to be 61 non-nM values inside the fetched bioactivity data.

bioact_df = bioact_df[bioact_df["standard_units"] == "nM"]

I then narrowed the results to only “nM” and checked the dataframe again to see what units were left now.

# Check there were only nM
bioact_df["standard_units"].unique()
array(['nM'], dtype=object)

So the filtering worked and the number of rows and columns were reduced.

# Check df rows & columns
bioact_df.shape
(645, 12)

Next part would be to remove all the duplicates in the dataframe, especially when there were duplicate tests for the same compound.

bioact_df.drop_duplicates("molecule_chembl_id", keep = "first", inplace = True)

Renamed the “standard_value” and “standard_units” columns to “Ki” and “units” respectively.

bioact_df.rename(
    columns = {
        "standard_value": "Ki",
        "standard_units": "units"
    }, inplace = True
)

# Check df to ensure name change
bioact_df.head(3)
activity_id assay_chembl_id assay_description assay_type data_validity_comment molecule_chembl_id relation standard_type units Ki target_chembl_id target_organism
0 111024 CHEMBL641011 Inhibition constant determined against Acetylc... B Potential transcription error CHEMBL11805 = Ki nM 0.104 CHEMBL220 Homo sapiens
1 118575 CHEMBL641012 Inhibitory activity against human AChE B None CHEMBL208599 = Ki nM 0.026 CHEMBL220 Homo sapiens
2 125075 CHEMBL641011 Inhibition constant determined against Acetylc... B None CHEMBL60745 = Ki nM 1.630 CHEMBL220 Homo sapiens

Lastly, the index of the dataframe was reset.

bioact_df.reset_index(drop = True, inplace = True)
bioact_df.head(3)
activity_id assay_chembl_id assay_description assay_type data_validity_comment molecule_chembl_id relation standard_type units Ki target_chembl_id target_organism
0 111024 CHEMBL641011 Inhibition constant determined against Acetylc... B Potential transcription error CHEMBL11805 = Ki nM 0.104 CHEMBL220 Homo sapiens
1 118575 CHEMBL641012 Inhibitory activity against human AChE B None CHEMBL208599 = Ki nM 0.026 CHEMBL220 Homo sapiens
2 125075 CHEMBL641011 Inhibition constant determined against Acetylc... B None CHEMBL60745 = Ki nM 1.630 CHEMBL220 Homo sapiens

One final check on the number of columns and rows after preprocessing the bioactivity dataframe.

bioact_df.shape
(540, 12)

There were a total of 12 columns with 540 rows of data left in the bioactivity dataframe.


Fetching assay data

The assay data was added after I went through the rest of the data preprocessing and also after remembering to check on the confidence scores for assays used in the final data collected (to somewhat assess assay-to-target relationships). This link from ChEMBL explained what the confidence score meant.

assays = assay_api.filter(
    # Use the previously saved target ChEMBL ID
    target_chembl_id = chembl_id, 
    # Binding assays only as before
    assay_type = "B"
).only(
    "assay_chembl_id",
    "confidence_score"
)

Placing the fetched assay data into a Pandas DataFrame.

assays_df = pd.DataFrame.from_dict(assays)

print(assays_df.shape)
assays_df.head(3)
(2044, 2)
assay_chembl_id confidence_score
0 CHEMBL634034 8
1 CHEMBL642512 8
2 CHEMBL642513 8
assays_df.describe()
confidence_score
count 2044.000000
mean 8.778865
std 0.415113
min 8.000000
25% 9.000000
50% 9.000000
75% 9.000000
max 9.000000

It looked like the lowest confidence score for this particular protein target in binding assays was at 8, with others sitting at 9 (the highest). There were 452 assays with confidence score of 8.

# Some had score of 8 - find out which ones
assays_df[assays_df["confidence_score"] == 8]
assay_chembl_id confidence_score
0 CHEMBL634034 8
1 CHEMBL642512 8
2 CHEMBL642513 8
3 CHEMBL642514 8
4 CHEMBL642515 8
... ... ...
1141 CHEMBL3887379 8
1142 CHEMBL3887855 8
1143 CHEMBL3887947 8
1144 CHEMBL3888161 8
1874 CHEMBL5058677 8

452 rows × 2 columns


Combining bioactivity & assay data

The key was to combine the bioactivity and assay data along the “assay_chembl_id” column.

bioact_assay_df = pd.merge(
    bioact_df[["assay_chembl_id", "molecule_chembl_id", "Ki", "units", "data_validity_comment"]],
    assays_df,
    on = "assay_chembl_id",
)
print(bioact_assay_df.shape)
bioact_assay_df.head(3)
(540, 6)
assay_chembl_id molecule_chembl_id Ki units data_validity_comment confidence_score
0 CHEMBL641011 CHEMBL11805 0.104 nM Potential transcription error 8
1 CHEMBL641011 CHEMBL60745 1.630 nM None 8
2 CHEMBL641012 CHEMBL208599 0.026 nM None 8

I actually came back to this step to relax the confidence score limit to include all the 8s as well as the 9s (otherwise previously I tried only using assays with score of 9), so that donepezil and galantamine could be included in the dataset as well (the purpose of this would be clearer in post 3 when building the model).


Fetching compound data

While having identified the protein target, obtained the bioactivity data, and also the assay data, this next step was to fetch the compound data. This could be done by having the ChEMBL IDs available in the bioactivity dataset.

cpds = cpd_api.filter(
    molecule_chembl_id__in = list(bioact_df["molecule_chembl_id"])
).only(
    "molecule_chembl_id",
    "molecule_structures",
    "max_phase"
)

Here, the same step was applied where the compound QuerySet object was converted into a Pandas dataframe. However, the compound data extracted here might take longer than the bioactivity one. One way to monitor progress was through using tqdm package.

compds = list(tqdm(cpds))
  0%|          | 0/540 [00:00<?, ?it/s]
 93%|█████████▎| 501/540 [00:00<00:00, 4845.36it/s]
100%|██████████| 540/540 [00:00<00:00, 5020.27it/s]

Converting retrieved compound QuerySet into a Pandas DataFrame.

cpds_df = pd.DataFrame.from_records(compds)
print(cpds_df.shape)
cpds_df.head(3)
(540, 3)
max_phase molecule_chembl_id molecule_structures
0 None CHEMBL28 {'canonical_smiles': 'O=c1cc(-c2ccc(O)cc2)oc2c...
1 3.0 CHEMBL50 {'canonical_smiles': 'O=c1c(O)c(-c2ccc(O)c(O)c...
2 None CHEMBL8320 {'canonical_smiles': 'O=C1C=CC(=O)C=C1', 'molf...


Preprocess compound data

Removing any missing entries in the compound data (excluding the “max_phase” column as it was needed during the model training/testing part in post 3 - note: “None” entries meant they were preclinical molecules so not assigned with a max phase yet).

cpds_df.dropna(subset = ["molecule_chembl_id", "molecule_structures"], axis = 0, how = "any", inplace = True)

# Check columns & rows in df
cpds_df.shape
(540, 3)

Removing any duplicates in the compound data.

cpds_df.drop_duplicates("molecule_chembl_id", keep = "first", inplace = True)

# Check columns & rows again
cpds_df.shape
(540, 3)

Ideally, only the compounds with canonical SMILES would be kept. Checking for the types of molecular representations used in the “molecule_structures” column of the compound dataset.

# Randomly choosing the 2nd entry as example
cpds_df.iloc[1].molecule_structures.keys()
dict_keys(['canonical_smiles', 'molfile', 'standard_inchi', 'standard_inchi_key'])

There were 4 types: “canonical_smiles”, “molfile”, “standard_inchi” and “standard_inchi_key”.

# Create an empty list to store the canonical smiles
can_smiles = []

# Create a for loop to loop over each row of data, 
# searching for only canonical_smiles to append to the created list
for i, cpd in cpds_df.iterrows():
    try:
        can_smiles.append(cpd["molecule_structures"]["canonical_smiles"])
    except KeyError:
        can_smiles.append(None)

# Create a new df column with name as "smiles", 
# which will store all the canonical smiles collected from the list above
cpds_df["smiles"] = can_smiles

Check the compound dataframe quickly to see if a new column for SMILES has been created.

cpds_df.head(3)
max_phase molecule_chembl_id molecule_structures smiles
0 None CHEMBL28 {'canonical_smiles': 'O=c1cc(-c2ccc(O)cc2)oc2c... O=c1cc(-c2ccc(O)cc2)oc2cc(O)cc(O)c12
1 3.0 CHEMBL50 {'canonical_smiles': 'O=c1c(O)c(-c2ccc(O)c(O)c... O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12
2 None CHEMBL8320 {'canonical_smiles': 'O=C1C=CC(=O)C=C1', 'molf... O=C1C=CC(=O)C=C1

Once confirmed, the old “molecule_structures” column was then removed.

cpds_df.drop("molecule_structures", axis = 1, inplace = True)

Finally, adding another step to ensure all missing entries or entries without canonical SMILES strings were removed from the compound dataset.

cpds_df.dropna(subset = ["smiles"], axis = 0, how = "any", inplace = True)

print(cpds_df.shape)
(540, 3)

Final look at the compound dataset, which should only include max phase, compound ChEMBL IDs and SMILES columns.

cpds_df.head(3)
max_phase molecule_chembl_id smiles
0 None CHEMBL28 O=c1cc(-c2ccc(O)cc2)oc2cc(O)cc(O)c12
1 3.0 CHEMBL50 O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12
2 None CHEMBL8320 O=C1C=CC(=O)C=C1


Combining bioactivity and compound data

To combine both datasets, the key was to look for common column (similar to a SQL “join” query) between the two datasets.

Listing all the column names for both datasets.

bioact_assay_df.columns
Index(['assay_chembl_id', 'molecule_chembl_id', 'Ki', 'units',
       'data_validity_comment', 'confidence_score'],
      dtype='object')
cpds_df.columns
Index(['max_phase', 'molecule_chembl_id', 'smiles'], dtype='object')

Clearly, the column that existed in both dataframes was the “molecule_chembl_id” column.

The next step was to combine or merge both datasets.

# Create a final dataframe that will contain both bioactivity and compound data
dtree_df = pd.merge(
    bioact_assay_df[["molecule_chembl_id","Ki", "units", "data_validity_comment"]],
    cpds_df,
    on = "molecule_chembl_id",
)

dtree_df.head(3)
molecule_chembl_id Ki units data_validity_comment max_phase smiles
0 CHEMBL11805 0.104 nM Potential transcription error None COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)...
1 CHEMBL60745 1.630 nM None None CC[N+](C)(C)c1cccc(O)c1.[Br-]
2 CHEMBL208599 0.026 nM None None CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2

Shape of the final dataframe was checked.

print(dtree_df.shape)
(540, 6)

Saving a copy of the merged dataframe for now to avoid re-running the previous code repeatedly, and also to be ready for second-half of the data preprocessing work, which will be in post 2.

dtree_df.to_csv("ache_chembl.csv")

References

Kalliokoski, Tuomo, Christian Kramer, Anna Vulpetti, and Peter Gedeck. 2013. “Comparability of Mixed IC50 Data A Statistical Analysis.” Edited by Andrea Cavalli. PLoS ONE 8 (4): e61007. https://doi.org/10.1371/journal.pone.0061007.
Kramer, Christian, Tuomo Kalliokoski, Peter Gedeck, and Anna Vulpetti. 2012. “The Experimental Uncertainty of Heterogeneous Public Ki Data.” Journal of Medicinal Chemistry 55 (11): 5165–73. https://doi.org/10.1021/jm300131x.
Tilborg, Derek van, Alisa Alenicheva, and Francesca Grisoni. 2022. “Exposing the Limitations of Molecular Machine Learning with Activity Cliffs.” Journal of Chemical Information and Modeling 62 (23): 5938–51. https://doi.org/10.1021/acs.jcim.2c01073.