How it all begins
I have always wondered if there’s a more succinct, straightforward or better way to look at ADRs. I’m also aware that ADRs as a subject matter is complicated enough in research, but I also can’t help to think that it is also how a newly initiated drug commonly fails in clinical trials (with tons of resources going down the drain after that), and also seeing people experiencing ADRs in real-life previously, it will be unwise to avoid it in the drug discovery and development landscape. So here’s my very tiny attempt at this tremendous task of sorting out how to minimise ADRs by starting at the very beginning, at the preclinical and computational stage.
My very early idea is to integrate ADRs with cytochrome P450 (CYP) as they’re closely linked by the effects of CYP inhibitors and inducers. During my previous role, there are a lot of well-known drug reference sources being used frequently on a daily basis, e.g. Martindale, Micromedex, American society of health-system pharmacists’ (ASHP) drug information monographs and so on, and I thought they may be another great sources of drug data, which are known to be absolutely critical in machine learning (ML) tasks (with models being the other crucial component).
I then started by having a look at SIDER database, which is also another great resource for ADRs and seems very detailed and useful at the first glance. However upon more closer looks, I’ve noted a few things that may need to be improved in order for it to be even more useful. I’ve noted that it is only last updated in 2015 (checked on 16th October 2024), so it’s a bit outdated and appears to be no longer maintained by anyone by the look of a few unanswered issues in its repository (well, I’ve checked back at the SIDER database web link again on 17th April 2025, it seems there’s no further fundings to maintain the project so that’s why…). There are only 1430 drugs available but I guess that’s a good starting point. One of its drug examples, imatinib, has had malignant neoplasm listed as the ADR with highest incidence. I would personally categorise it more under carcinogenicity of the drug instead, and it’s dose-related so can be circumvented. Also imatinib is for leukaemia and other bone marrow cancers so it may be a bit misleading to list it as a “side effect” when its therapeutic indication is also for malignancy. At this point, my idea has evolved to see if it’s possible to actually curate a small, detailed and thorough dataset about ADRs, aiming for better data quality than quantity.
Other reasons for this work
It’s been known that combining results from in vitro biological assays is a major issue in ML predictions due to the wide variations in in vitro experiments running in different labs leading to noises in the ML-predicted outcomes. These variations can be from different equipments used (calibration differences), variations in measurements, transcription errors (possible human errors) and likely many others I haven’t mentioned here. So to mitigate this, my thought is to possibly treat clinical trials as live, in vivo biological assays and assemble data from clinical trials (which are what the pharmaceutical manufacturers’ data sheets usually contain). However, the possible underlying problem here is that there are genetics variations in human populations, age group and gender differences, and likely others as well. Another downside is that clinical trials are usually done in very restricted, highly selective and controlled environments so they won’t be very representative of the real patient populations. This is also the reason why I’m including postmarketing reports in the dataset I’ve been curating. The only thing that may possibly help with these issues is then to document data in details so that these specific factors can be taken into account when training ML models (e.g. when tuning hyperparameters or by using other strategies).
TL;DR: It may be good to add another layer of data complexity, e.g. the Flockhart cytochrome P450 drug-drug interaction table (Flockhart et al. 2021), to better reflect real-life drug effects in human uses. One more reason for this work which springs to my mind after viewing several different ADMET prediction applications are that most of them seem to be using in vitro-based data (a few other ones use PubChem, which can vary too from different sources e.g. journal papers or from drugbank etc.). I think if the Flockhart table can be added at the same time (to include in-vivo evidence in humans, although as I look through some of the citations for drugs in the table, it is also true that not all of them have in-vivo human-use evidence, however it should have a decent number of them in it). Overall, this may make ADMET research more well-covered and may improve data quality even more. Another thing is Flockhart table differentiates drugs with strong clinical evidence from the moderate ones, and also other ones pending reviews - this is probably something in-vitro data won’t be able to provide.
What have been done so far
Data sources
Details of data sources used and related information for the current ADRs dataset (file name: cyp_substrates_adrs.csv) will be documented in the ADRs data note. The same data note also applies to the earlier smaller dataset on ADRs for CYP3A4 substrates (file name: cyp3a4_substrates.csv) as well, which is used in the notebook mentioned below.
Data curation
Before immersing in the process of curating ADRs data, I’ve made a visit to PubMed briefly to search for papers regarding ADRs, trying to find out which drugs or therapeutic areas are commonly involved in ADRs in real-life. I’ve found 3 papers (Onakpoya, Heneghan, and Aronson 2016), (Insani et al. 2021), (Lisha et al. 2017) that have investigated within critical care settings in hospitals, primary care settings in the communities and worldwide drug withdrawals regarding ADRs. The key problematic therapeutic areas involved in ADRs are:
- antimicrobials (for critical medical setting)
- cardiovascular system/anticoagulants (for critical & primary settings)
- analgesics and sedatives (for critical surgical setting)
- hepatic issues (applies in general)
- central nervous system issues (applies in general)
While curating the ADRs dataset, I also have a feeling that this type of data may have already existed in proprietary settings, but probably not the open-source versions or if they do, it may have a low chance of being curated by a pharmacist/researcher (me here) who may have a slightly different angle or perspective on therapeutic drugs in general.
Currently, an introductory ADRs regressor notebook has been done and stored in this repository showing how ADRs can be represented as PyTorch tensors (with the idea originated from natural language processing (NLP)) in a simple two-layer deep neural network model after I’ve initially curated ADRs for CYP3A4 substrates only. From there, I’ve then attempted to predict therapeutic drug classes for a testing set of molecules within this dataset using shuffled Butina split. The dataset of course is way too small for a deep learning model, and is likely flawed, biased and prone to data leakage etc., but the main idea here is to show that we may be able to observe some patterns or links between drugs (or molecular fingerprints) and ADRs first.
After that, I’ve gone on to curate more CYP substrates-ADRs data based on the Flockhart table, which currently contains ADRs for CYP3A4, 2D6, 2C19, 2C9, 1A2, 2B6, 2E1 and 2C8 substrates. The ADRs documented in the dataset have focussed on ones that are more likely going to affect the qualities of lives of people administering the drugs (CYP substrates) and ones that are potentially going to be life-threatening.
ADRs flow chart
Below is a flow chart showing possible interactions between drugs and ADRs, with dotted lines representing possible or indirect relationships. This is also not meant to be comprehensive as I’m sure there are many other aspects I haven’t mentioned here, the purpose here is to really stimulate further research ideas about ADRs.
flowchart LR
A(drugs) -.-> B(activities)
A -- ? via other targets --> C(adverse effects)
D(clinical trials) --> C(ADRs)
E(postmarketing reports) --> C(ADRs)
F(biological target) <--> B(biological activities)
A <--> F(biological target)
F -.-> C(ADRs)
G(CYP3A4/5 substrates) --> C(ADRs)
A --> G(CYP450 substrates)
H(CYP450-drug metabolism) --> G(CYP450 substrates)
A --> H(CYP450 metabolism)
H --> I(CYP450 inhibitors)
I --> C(ADRs)
A --> D(clinical trials)
A --> E(postmarketing reports)
A --> I(CYP450 inhibitors)
I --> G(CYP450 substrates)
Possible future work
Here are my very early thoughts on how this ADRs dataset may be further used and extended in the future:
- Exploring the possibility of structure-ADRs relationships where we may be able to link ADRs via treating them as dense vectors of real numbers (early demonstration in the introductory ADRs regressor notebook) with two-dimensional (2D) drug molecular structures. This is coming from the commonly known structure-activity or property relationships used for small molecules in drug discovery or materials science, where we’re actively seeking the connections between drug or molecular activities with their 2D molecular structures. I’m just having a very naive thought here about why not looking at this from the ADRs perspective, rather than the usual activity/property part of the molecules. This may shed some lights about how ADRs relate to chemical structures of drugs, potentially may be useful when desgining drugs (a bit like a risk aversion way).
By focussing on specific therapeutic targets (within the same therapeutic drug classes), and trying to learn the patterns of ADRs for the same target. It may be possible to look at what common CYP enzymes are involved in their metabolisms and what common ADRs they all share - trying to link them back to drug structures and we may be able to set up ADRs profiles for each CYP enzyme that can be drug class-specific or CYP enzyme type-specific. From here, we may be able to draw out drug-ADRs lineages e.g. a parent compound is metabolised to a child compound where both may be biologically active and related in therapeutic uses with similar or different ADRs.
This work is opened to other ideas as well as I’m sure there are very likely other things I haven’t thought of yet!
References
Flockhart, DA., D. Thacker, C. McDonald, and Z. Desta. 2021.
“The Flockhart Cytochrome P450 Drug-Drug Interaction Table.” https://drug-interactions.medicine.iu.edu/.
Insani, WN., C. Whittlesea, H. Alwafi, KKC. Man, S. Chapman, and L. Wei. 2021.
“Prevalence of Adverse Drug Reactions in the Primary Care Setting: A Systematic Review and Meta-Analysis.” PLoS ONE 16 (5).
https://doi.org/10.1371/journal.pone.0252161.
Lisha, J., V. Annalakshmi, J. Maria, and D. Padmini. 2017.
“Adverse Drug Reactions in Critical Care Settings: A Systematic Review.” Current Drug Safety 12 (3): 147–61.
https://doi.org/10.2174/1574886312666170710192409.
Onakpoya, IJ., CJ. Heneghan, and JK. Aronson. 2016.
“Worldwide Withdrawal of Medicinal Products Because of Adverse Drug Reactions: A Systematic Review and Analysis.” Critical Reviews in Toxicology 46 (6): 477–89.
https://doi.org/10.3109/10408444.2016.1149452.