# Import libraries
# Fetch data through ChEMBL web resource client
from chembl_webresource_client.new_client import new_client
# Dataframe library
import pandas as pd
# Progress bar
from tqdm import tqdm
Series overview
Post 1 (this post) - data collection from ChEMBL database using web resource client in Python, with initial data preprocessing
Post 2 - more data preprocessing and transformation to reach the final dataset prior to model building
Post 3 - estimating experimental errors and building decision tree model using scikit-learn
Introduction
I’ve now come to a stage to do some more machine learning (ML) work after reading a few peer-reviewed papers about ML and drug discovery. It seemed that traditional ML methods were still indispensible performance-wise, and when used in combination with deep learning neural networks, they tend to increase prediction accuracy more. I also haven’t ventured into the practicality and usefulness of large language models in drug discovery yet, but I’m aware work in this area has been started. However, comments from experienced seniors did mention that they are still very much novel and therefore may not be as useful yet. Although by the speed of how things evolve in the so-called “AI” field, this possibly may change very soon. Also from what I can imagine, molecular representations in texts or strings are not quite the same as natural human language texts, since there are a lot of other chemistry-specific features to consider, e.g. chiralities, aromaticities and so on. Because of this, I’m sticking with learning to walk first by trying to cover conventional ML methods in a more thorough way, before trying to run in the deep learning zone.
So this leads to this series of posts (3 in total) about decision tree. Previously, I’ve only lightly touched on a commonly used classifier algorithm, logistic regression, as the first series in the ML realm. Reflecting back, I think I could’ve done a more thorough job during the data preparation stage. So this would be attempted this time. The data preparation used here was carried out with strong reference to the materials and methods section in this paper (Tilborg, Alenicheva, and Grisoni 2022), which was one of the papers I’ve read. There are probably other useful methods out there, but this paper made sense to me, so I’ve adopted a few of their ways of doing things during data preprocessing.
Data retrieval
This time I decided to try something new which was to use the ChEMBL web resource client to collect data (i.e. not by direct file downloads from ChEMBL website, although other useful way could be through SQL queries, which is also on my list to try later). I found this great online resource about fetching data this way from the TeachOpenCADD talktorial on compound data acquisition. The data retrieval workflow used below was mainly adapted from this talktorial with a few changes to suit the selected dataset and ML model.
The web resource client was supported by the ChEMBL group and was based on a Django QuerySet interface. Their GitHub repository might explain a bit more about it, particularly the Jupyter notebook link provided in the repository would help a lot regarding how to write code to search for specific data.
To do this, a few libraries needed to be loaded first.
To see what types of data were provided by ChEMBL web resource client, run the following code and refer to ChEMBL documentations to find out what data were embedded inside different data categories. Sometimes, it might not be that straight forward and some digging would be required (I went back to this step below to find the “data_validity_comment” when I was trying to do some compound sanitisations actually).
= [resource for resource in dir(new_client) if not resource.startswith('_')]
available_resources print(available_resources)
['activity', 'activity_supplementary_data_by_activity', 'assay', 'assay_class', 'atc_class', 'binding_site', 'biotherapeutic', 'cell_line', 'chembl_id_lookup', 'compound_record', 'compound_structural_alert', 'description', 'document', 'document_similarity', 'drug', 'drug_indication', 'drug_warning', 'go_slim', 'image', 'mechanism', 'metabolism', 'molecule', 'molecule_form', 'official', 'organism', 'protein_classification', 'similarity', 'source', 'substructure', 'target', 'target_component', 'target_relation', 'tissue', 'xref_source']
Resource objects were created to enable API access as suggested by the talktorial.
# for targets (proteins)
= new_client.target
targets_api
# for bioactivities
= new_client.activity
bioact_api
# for assays
= new_client.assay
assay_api
# for compounds
= new_client.molecule cpd_api
Checked object type for one of these API objects (e.g. bioactivity API object).
type(bioact_api)
chembl_webresource_client.query_set.QuerySet
Fetching target data
A protein target e.g. acetylcholinesterase was randomly chosen by using UniProt to look up the protein UniProt ID.
# Specify Uniprot ID for acetylcholinesterase
= "P22303"
uniprot_id
# Get info from ChEMBL about this protein target,
# with selected features only
= targets_api.get(target_components__accession = uniprot_id).only(
targets "target_chembl_id",
"organism",
"pref_name",
"target_type"
)
The query results were stored in a “targets” object, which was a QuerySet with lazy data evaluation only, meaning it would only react when there was a request for the data. Therefore, to see the results, the “targets” object was then read through Pandas DataFrame.
# Read "targets" with Pandas
= pd.DataFrame.from_records(targets)
targets targets
organism | pref_name | target_chembl_id | target_type | |
---|---|---|---|---|
0 | Homo sapiens | Acetylcholinesterase | CHEMBL220 | SINGLE PROTEIN |
1 | Homo sapiens | Acetylcholinesterase | CHEMBL220 | SINGLE PROTEIN |
2 | Homo sapiens | Cholinesterases; ACHE & BCHE | CHEMBL2095233 | SELECTIVITY GROUP |
Selected the first protein target from this dataframe.
# Save the first protein in the dataframe
= targets.iloc[0]
select_target select_target
organism Homo sapiens
pref_name Acetylcholinesterase
target_chembl_id CHEMBL220
target_type SINGLE PROTEIN
Name: 0, dtype: object
Then saved the selected ChEMBL ID for the first protein (to be used later).
= select_target.target_chembl_id
chembl_id # Check it's saved
print(chembl_id)
CHEMBL220
Fetching bioactivity data
Obtaining bioactivity data for the selected target.
= bioact_api.filter(
bioact # Use the previously saved target ChEMBL ID
= chembl_id,
target_chembl_id # Selecting for Ki
= "Ki",
standard_type # Requesting exact measurements
= "=",
relation # Binding data as "B"
= "B",
assay_type
).only("activity_id",
"data_validity_comment"
"assay_chembl_id",
"assay_description",
"assay_type",
"molecule_chembl_id",
"standard_units",
"standard_type",
"relation",
"standard_value",
"target_chembl_id",
"target_organism"
)
# Check the length and type of bioactivities object
print(len(bioact), type(bioact))
706 <class 'chembl_webresource_client.query_set.QuerySet'>
To have a quick look at the data being held inside each entry of the bioactivity dataset, e.g. for first entry.
print(len(bioact[0]), type(bioact[0]))
0] bioact[
15 <class 'dict'>
{'activity_id': 111024,
'assay_chembl_id': 'CHEMBL641011',
'assay_description': 'Inhibition constant determined against Acetylcholinesterase (AChE) receptor.',
'assay_type': 'B',
'data_validity_comment': 'Potential transcription error',
'molecule_chembl_id': 'CHEMBL11805',
'relation': '=',
'standard_type': 'Ki',
'standard_units': 'nM',
'standard_value': '0.104',
'target_chembl_id': 'CHEMBL220',
'target_organism': 'Homo sapiens',
'type': 'Ki',
'units': 'nM',
'value': '0.104'}
The next step might take a few minutes - downloading the QuerySet as a Pandas DataFrame.
= pd.DataFrame.from_dict(bioact)
bioact_df
3) bioact_df.head(
activity_id | assay_chembl_id | assay_description | assay_type | data_validity_comment | molecule_chembl_id | relation | standard_type | standard_units | standard_value | target_chembl_id | target_organism | type | units | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 111024 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | Potential transcription error | CHEMBL11805 | = | Ki | nM | 0.104 | CHEMBL220 | Homo sapiens | Ki | nM | 0.104 |
1 | 118575 | CHEMBL641012 | Inhibitory activity against human AChE | B | None | CHEMBL208599 | = | Ki | nM | 0.026 | CHEMBL220 | Homo sapiens | Ki | nM | 0.026 |
2 | 125075 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | None | CHEMBL60745 | = | Ki | nM | 1.63 | CHEMBL220 | Homo sapiens | Ki | nM | 1.63 |
Checked total rows and columns in the bioactivities dataframe.
bioact_df.shape
(706, 15)
Preprocess bioactivity data
When I reached the second half of data preprocessing, an alarm bell went off regarding using half maximal inhibitory concentration (IC50) values in ChEMBL. I remembered reading recent blog posts by Greg Landrum about using IC50 and inhibition constant (Ki) values from ChEMBL. A useful open-access paper (Kalliokoski et al. 2013) from 2013 also looked into this issue about using mixed IC50 data in ChEMBL, and provided a thorough overview about how to deal with situations like this. There was also another paper (Kramer et al. 2012) on mixed Ki data from the same author group in 2012 that touched on similar issues.
To summarise both the paper about IC50 and blog posts mentioned above:
it would be the best to check the details of assays used to test the compounds to ensure they were aligned and not extremely heterogeneous, since IC50 values were very assay-specific, and knowing that these values were extracted from different papers from different labs all over the world, mixing them without knowing was definitely not a good idea
the slightly better news was that it was more likely okay to combine Ki values for the same protein target from ChEMBL as they were found to be adding less noise to the data (however ideally similar data caution should also apply)
it was also possible to mix Ki values with IC50 values, but the data would need to be corrected via using a conversion factor of 2.0 to convert Ki values to IC50 values (note: I also wondered if this needed to be re-looked again since this paper was published 10 years ago…)
Because of this, I decided to stick with Ki values only for now before adding more complexities as I wasn’t entirely confident about mixing IC50 values with Ki values yet. Firstly, I checked for all types of units being used in bioact_df. There were numerous different units and formats, which meant they would need to be converted to nanomolar (nM) units first.
"units"].unique() bioact_df[
array(['nM', 'M', 'uM', None, 'pM', "10'-9M", "10'-3M", "10'-6M",
"10'-10M", '/min/M', "10'5/M/min", "10'2/M/min", "10'3/M/min",
"10'8/M/min", "10'7/M/min", 'microM/L', 'umol/L', 'mM',
"10'4/M/min", "10'6/M/min", 'mM/min', '10^8M'], dtype=object)
Checking again that I’ve fetched Ki values only.
"standard_type"].unique() bioact_df[
array(['Ki'], dtype=object)
It looked like there were duplicates of columns on units and values, so the “units” and “value” columns were removed and “standard_units” and “standard_value” columns were kept instead. Also, “type” column was dropped as there were already a “standard_type” column.
"units", "value", "type"], axis = 1, inplace = True)
bioact_df.drop([# Re-check df
3) bioact_df.head(
activity_id | assay_chembl_id | assay_description | assay_type | data_validity_comment | molecule_chembl_id | relation | standard_type | standard_units | standard_value | target_chembl_id | target_organism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 111024 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | Potential transcription error | CHEMBL11805 | = | Ki | nM | 0.104 | CHEMBL220 | Homo sapiens |
1 | 118575 | CHEMBL641012 | Inhibitory activity against human AChE | B | None | CHEMBL208599 | = | Ki | nM | 0.026 | CHEMBL220 | Homo sapiens |
2 | 125075 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | None | CHEMBL60745 | = | Ki | nM | 1.63 | CHEMBL220 | Homo sapiens |
bioact_df.dtypes
activity_id int64
assay_chembl_id object
assay_description object
assay_type object
data_validity_comment object
molecule_chembl_id object
relation object
standard_type object
standard_units object
standard_value object
target_chembl_id object
target_organism object
dtype: object
The column of “standard_value” was converted from “object” to “float64” so we could use the Ki values for calculations later.
= bioact_df.astype({"standard_value": "float64"})
bioact_df # Check column data types again
bioact_df.dtypes
activity_id int64
assay_chembl_id object
assay_description object
assay_type object
data_validity_comment object
molecule_chembl_id object
relation object
standard_type object
standard_units object
standard_value float64
target_chembl_id object
target_organism object
dtype: object
Then the next step was taking care of any missing entries by removing them in the first place. I excluded “data_validity_comment” column here as this was required to check if there were any unusual activity data e.g. excessively low or high Ki values. A lot of the compounds in this column probably had empty cells or “None”, which ensured that there were no particular alarm bells to the extracted bioactivity data.
= ["activity_id", "assay_chembl_id", "assay_description", "assay_type", "molecule_chembl_id", "relation", "standard_type", "standard_units", "standard_value", "target_chembl_id", "target_organism"], axis = 0, how = "any", inplace = True)
bioact_df.dropna(subset # Check number of rows and columns again (in this case, there appeared to be no change for rows)
bioact_df.shape
(706, 12)
Since all unique units inside the “units” and “values” columns were checked previously, I’d done the same for the “standard_units” column to see the ones recorded in it.
"standard_units"].unique() bioact_df[
array(['nM', '/min/M', "10'5/M/min", "10'2/M/min", "10'3/M/min",
"10'8/M/min", "10'7/M/min", "10'4/M/min", "10'6/M/min", 'mM/min',
'10^8M'], dtype=object)
There were a mixture of different units.
# Check for number of non-nM units
"standard_units"] != "nM"].shape[0] bioact_df[bioact_df[
61
There appeared to be 61 non-nM values inside the fetched bioactivity data.
= bioact_df[bioact_df["standard_units"] == "nM"] bioact_df
I then narrowed the results to only “nM” and checked the dataframe again to see what units were left now.
# Check there were only nM
"standard_units"].unique() bioact_df[
array(['nM'], dtype=object)
So the filtering worked and the number of rows and columns were reduced.
# Check df rows & columns
bioact_df.shape
(645, 12)
Next part would be to remove all the duplicates in the dataframe, especially when there were duplicate tests for the same compound.
"molecule_chembl_id", keep = "first", inplace = True) bioact_df.drop_duplicates(
Renamed the “standard_value” and “standard_units” columns to “Ki” and “units” respectively.
bioact_df.rename(= {
columns "standard_value": "Ki",
"standard_units": "units"
= True
}, inplace
)
# Check df to ensure name change
3) bioact_df.head(
activity_id | assay_chembl_id | assay_description | assay_type | data_validity_comment | molecule_chembl_id | relation | standard_type | units | Ki | target_chembl_id | target_organism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 111024 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | Potential transcription error | CHEMBL11805 | = | Ki | nM | 0.104 | CHEMBL220 | Homo sapiens |
1 | 118575 | CHEMBL641012 | Inhibitory activity against human AChE | B | None | CHEMBL208599 | = | Ki | nM | 0.026 | CHEMBL220 | Homo sapiens |
2 | 125075 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | None | CHEMBL60745 | = | Ki | nM | 1.630 | CHEMBL220 | Homo sapiens |
Lastly, the index of the dataframe was reset.
= True, inplace = True)
bioact_df.reset_index(drop 3) bioact_df.head(
activity_id | assay_chembl_id | assay_description | assay_type | data_validity_comment | molecule_chembl_id | relation | standard_type | units | Ki | target_chembl_id | target_organism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 111024 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | Potential transcription error | CHEMBL11805 | = | Ki | nM | 0.104 | CHEMBL220 | Homo sapiens |
1 | 118575 | CHEMBL641012 | Inhibitory activity against human AChE | B | None | CHEMBL208599 | = | Ki | nM | 0.026 | CHEMBL220 | Homo sapiens |
2 | 125075 | CHEMBL641011 | Inhibition constant determined against Acetylc... | B | None | CHEMBL60745 | = | Ki | nM | 1.630 | CHEMBL220 | Homo sapiens |
One final check on the number of columns and rows after preprocessing the bioactivity dataframe.
bioact_df.shape
(540, 12)
There were a total of 12 columns with 540 rows of data left in the bioactivity dataframe.
Fetching assay data
The assay data was added after I went through the rest of the data preprocessing and also after remembering to check on the confidence scores for assays used in the final data collected (to somewhat assess assay-to-target relationships). This link from ChEMBL explained what the confidence score meant.
= assay_api.filter(
assays # Use the previously saved target ChEMBL ID
= chembl_id,
target_chembl_id # Binding assays only as before
= "B"
assay_type
).only("assay_chembl_id",
"confidence_score"
)
Placing the fetched assay data into a Pandas DataFrame.
= pd.DataFrame.from_dict(assays)
assays_df
print(assays_df.shape)
3) assays_df.head(
(2044, 2)
assay_chembl_id | confidence_score | |
---|---|---|
0 | CHEMBL634034 | 8 |
1 | CHEMBL642512 | 8 |
2 | CHEMBL642513 | 8 |
assays_df.describe()
confidence_score | |
---|---|
count | 2044.000000 |
mean | 8.778865 |
std | 0.415113 |
min | 8.000000 |
25% | 9.000000 |
50% | 9.000000 |
75% | 9.000000 |
max | 9.000000 |
It looked like the lowest confidence score for this particular protein target in binding assays was at 8, with others sitting at 9 (the highest). There were 452 assays with confidence score of 8.
# Some had score of 8 - find out which ones
"confidence_score"] == 8] assays_df[assays_df[
assay_chembl_id | confidence_score | |
---|---|---|
0 | CHEMBL634034 | 8 |
1 | CHEMBL642512 | 8 |
2 | CHEMBL642513 | 8 |
3 | CHEMBL642514 | 8 |
4 | CHEMBL642515 | 8 |
... | ... | ... |
1141 | CHEMBL3887379 | 8 |
1142 | CHEMBL3887855 | 8 |
1143 | CHEMBL3887947 | 8 |
1144 | CHEMBL3888161 | 8 |
1874 | CHEMBL5058677 | 8 |
452 rows × 2 columns
Combining bioactivity & assay data
The key was to combine the bioactivity and assay data along the “assay_chembl_id” column.
= pd.merge(
bioact_assay_df "assay_chembl_id", "molecule_chembl_id", "Ki", "units", "data_validity_comment"]],
bioact_df[[
assays_df,= "assay_chembl_id",
on
)print(bioact_assay_df.shape)
3) bioact_assay_df.head(
(540, 6)
assay_chembl_id | molecule_chembl_id | Ki | units | data_validity_comment | confidence_score | |
---|---|---|---|---|---|---|
0 | CHEMBL641011 | CHEMBL11805 | 0.104 | nM | Potential transcription error | 8 |
1 | CHEMBL641011 | CHEMBL60745 | 1.630 | nM | None | 8 |
2 | CHEMBL641012 | CHEMBL208599 | 0.026 | nM | None | 8 |
I actually came back to this step to relax the confidence score limit to include all the 8s as well as the 9s (otherwise previously I tried only using assays with score of 9), so that donepezil and galantamine could be included in the dataset as well (the purpose of this would be clearer in post 3 when building the model).
Fetching compound data
While having identified the protein target, obtained the bioactivity data, and also the assay data, this next step was to fetch the compound data. This could be done by having the ChEMBL IDs available in the bioactivity dataset.
= cpd_api.filter(
cpds = list(bioact_df["molecule_chembl_id"])
molecule_chembl_id__in
).only("molecule_chembl_id",
"molecule_structures",
"max_phase"
)
Here, the same step was applied where the compound QuerySet object was converted into a Pandas dataframe. However, the compound data extracted here might take longer than the bioactivity one. One way to monitor progress was through using tqdm package.
= list(tqdm(cpds)) compds
0%| | 0/540 [00:00<?, ?it/s]
93%|█████████▎| 501/540 [00:00<00:00, 4845.36it/s]
100%|██████████| 540/540 [00:00<00:00, 5020.27it/s]
Converting retrieved compound QuerySet into a Pandas DataFrame.
= pd.DataFrame.from_records(compds)
cpds_df print(cpds_df.shape)
3) cpds_df.head(
(540, 3)
max_phase | molecule_chembl_id | molecule_structures | |
---|---|---|---|
0 | None | CHEMBL28 | {'canonical_smiles': 'O=c1cc(-c2ccc(O)cc2)oc2c... |
1 | 3.0 | CHEMBL50 | {'canonical_smiles': 'O=c1c(O)c(-c2ccc(O)c(O)c... |
2 | None | CHEMBL8320 | {'canonical_smiles': 'O=C1C=CC(=O)C=C1', 'molf... |
Preprocess compound data
Removing any missing entries in the compound data (excluding the “max_phase” column as it was needed during the model training/testing part in post 3 - note: “None” entries meant they were preclinical molecules so not assigned with a max phase yet).
= ["molecule_chembl_id", "molecule_structures"], axis = 0, how = "any", inplace = True)
cpds_df.dropna(subset
# Check columns & rows in df
cpds_df.shape
(540, 3)
Removing any duplicates in the compound data.
"molecule_chembl_id", keep = "first", inplace = True)
cpds_df.drop_duplicates(
# Check columns & rows again
cpds_df.shape
(540, 3)
Ideally, only the compounds with canonical SMILES would be kept. Checking for the types of molecular representations used in the “molecule_structures” column of the compound dataset.
# Randomly choosing the 2nd entry as example
1].molecule_structures.keys() cpds_df.iloc[
dict_keys(['canonical_smiles', 'molfile', 'standard_inchi', 'standard_inchi_key'])
There were 4 types: “canonical_smiles”, “molfile”, “standard_inchi” and “standard_inchi_key”.
# Create an empty list to store the canonical smiles
= []
can_smiles
# Create a for loop to loop over each row of data,
# searching for only canonical_smiles to append to the created list
for i, cpd in cpds_df.iterrows():
try:
"molecule_structures"]["canonical_smiles"])
can_smiles.append(cpd[except KeyError:
None)
can_smiles.append(
# Create a new df column with name as "smiles",
# which will store all the canonical smiles collected from the list above
"smiles"] = can_smiles cpds_df[
Check the compound dataframe quickly to see if a new column for SMILES has been created.
3) cpds_df.head(
max_phase | molecule_chembl_id | molecule_structures | smiles | |
---|---|---|---|---|
0 | None | CHEMBL28 | {'canonical_smiles': 'O=c1cc(-c2ccc(O)cc2)oc2c... | O=c1cc(-c2ccc(O)cc2)oc2cc(O)cc(O)c12 |
1 | 3.0 | CHEMBL50 | {'canonical_smiles': 'O=c1c(O)c(-c2ccc(O)c(O)c... | O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12 |
2 | None | CHEMBL8320 | {'canonical_smiles': 'O=C1C=CC(=O)C=C1', 'molf... | O=C1C=CC(=O)C=C1 |
Once confirmed, the old “molecule_structures” column was then removed.
"molecule_structures", axis = 1, inplace = True) cpds_df.drop(
Finally, adding another step to ensure all missing entries or entries without canonical SMILES strings were removed from the compound dataset.
= ["smiles"], axis = 0, how = "any", inplace = True)
cpds_df.dropna(subset
print(cpds_df.shape)
(540, 3)
Final look at the compound dataset, which should only include max phase, compound ChEMBL IDs and SMILES columns.
3) cpds_df.head(
max_phase | molecule_chembl_id | smiles | |
---|---|---|---|
0 | None | CHEMBL28 | O=c1cc(-c2ccc(O)cc2)oc2cc(O)cc(O)c12 |
1 | 3.0 | CHEMBL50 | O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12 |
2 | None | CHEMBL8320 | O=C1C=CC(=O)C=C1 |
Combining bioactivity and compound data
To combine both datasets, the key was to look for common column (similar to a SQL “join” query) between the two datasets.
Listing all the column names for both datasets.
bioact_assay_df.columns
Index(['assay_chembl_id', 'molecule_chembl_id', 'Ki', 'units',
'data_validity_comment', 'confidence_score'],
dtype='object')
cpds_df.columns
Index(['max_phase', 'molecule_chembl_id', 'smiles'], dtype='object')
Clearly, the column that existed in both dataframes was the “molecule_chembl_id” column.
The next step was to combine or merge both datasets.
# Create a final dataframe that will contain both bioactivity and compound data
= pd.merge(
dtree_df "molecule_chembl_id","Ki", "units", "data_validity_comment"]],
bioact_assay_df[[
cpds_df,= "molecule_chembl_id",
on
)
3) dtree_df.head(
molecule_chembl_id | Ki | units | data_validity_comment | max_phase | smiles | |
---|---|---|---|---|---|---|
0 | CHEMBL11805 | 0.104 | nM | Potential transcription error | None | COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)... |
1 | CHEMBL60745 | 1.630 | nM | None | None | CC[N+](C)(C)c1cccc(O)c1.[Br-] |
2 | CHEMBL208599 | 0.026 | nM | None | None | CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2 |
Shape of the final dataframe was checked.
print(dtree_df.shape)
(540, 6)
Saving a copy of the merged dataframe for now to avoid re-running the previous code repeatedly, and also to be ready for second-half of the data preprocessing work, which will be in post 2.
"ache_chembl.csv") dtree_df.to_csv(