Random forest

Series 2.2 - model building, imbalanced dataset, feature importances & hyperparameter tuning

Machine learning projects
Tree models
Pandas
Scikit-learn
ChEMBL database
Python
Author

Jennifer HY Lin

Published

November 22, 2023

Post updated on 3rd May 2024 - Added comment regarding ImbalancedLearningRegression package (installation tip) & Jupyter notebook link of this post

Quick overview of this post
  • Short introduction of random forest
  • Random forest methods or classes in scikit-learn
  • Random forest regressor model in scikit-learn
  • Training and testing data splits
    • ChEMBL-assigned max phase splits
    • Imbalanced learning regression and max phase splits
  • Scoring metrics of trained models
  • Feature importances in dataset
    • feature_importances_attribute in scikit-learn
    • permutation_importance function in scikit-learn
    • SHAP approach
  • Hyperparameter tuning on number of trees


What is a random forest?

The decision tree model built last time was purely based on one model on its own, which often might not be as accurate or reflective in real-life. To improve the model, the average outcome from multiple models (Breiman 1998) should be considered to see if this would provide a more realistic image. This model averaging approach was also constantly used in our daily lives, for example, using majority votes during decision-making steps.

The same model averaging concept was also used in random forest (Breiman 2001), which as the name suggested, was composed of many decision trees (models) forming a forest. Each tree model would be making its own model prediction. By accruing multiple predictions since we have multiple trees, the average obtained from these predictions would produce one single result in the end. The advantage of this was that it improved the accuracy of the prediction by reducing variances, and also minimised the problem of overfitting the model if it was purely based on one model only (more details in section 1.11.2.1. Random Forests from scikit-learn).

The “random” part of the random forest was introduced in two ways. The first one was via using bootstrap samples, which was also known as bagging or bootstrap aggregating (Bruce, Bruce, and Gedeck 2020), where samples were drawn with replacements within the training datasets for each tree built in the ensemble (also known as the perturb-and-combine technique (Breiman 1998)). While bootstrap sampling was happening, randomness was also incorporated into the training sets at the same time. The second way randomness was introduced was by using a random subset of features for splitting at the nodes, or a full set of features could also be used (although this was generally not recommended). The main goal here was to achieve best splits at each node.


Random forest in scikit-learn

Scikit-learn had two main types of random forest classes - ensemble.RandomForestClassifier() and ensemble.RandomForestRegressor(). When to use which class would depend on the target values. The easiest thing to do was to decide whether the target variables had class labels (binary types or non-continuous variables e.g. yes or no, or other different categories to be assigned) or continuous (numerical) variables, which in this case, if I were to continue using the same dataset from the decision tree series, it would be a continuous variable or feature, pKi, the inhibition constant.

There were also two other alternative random forest methods in scikit-learn, which were ensemble.RandomTreesEmbedding() and ensemble.ExtraTreesClassifier() or ensemble.ExtraTreesRegressor(). The difference for RandomTreesEmbedding() was that it was an unsupervised method that used data transformations (more details from section 1.11.2.6. on “Totally Random Trees Embedding” in scikit-learn). On the other side, there was also an option to use ExtraTreesClassifier() or ExtraTreesRegressor() to generate extremely randomised trees that would go for another level up in randomness (more deatils in section 1.11.2.2. on Extremely Randomized Trees from scikit-learn). The main difference for this type of random forest was that while there was already a random subset of feature selection used (with an intention to select the most discerning features), more randomness were added on top of this by using purely randomly generated splitting rules for picking features at the nodes. The advantage of this type of method was that it would reduce variance and increase the accuracy of the model, but the downside was there might be an increase in bias within the model.


Building a random forest regressor model using scikit-learn

As usual, all the required libraries were imported first.

import pandas as pd
import sklearn
from sklearn.ensemble import RandomForestRegressor

# For imbalanced datasets in regression 
# May need to set env variable (SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True) when installing
# due to package dependency on older sklearn version
import ImbalancedLearningRegression as iblr

# Plots
import matplotlib.pyplot as plt
import seaborn as sns

# Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Feature importances
# Permutation_importance
from sklearn.inspection import permutation_importance
# SHAP values
import shap

# Hyperparameter tuning
from sklearn.model_selection import cross_val_score, RepeatedKFold

from numpy import mean, std
from natsort import index_natsorted
import numpy as np

# Showing version of scikit-learn used
print(sklearn.__version__)
1.3.2

Importing dataset that was preprocessed from last time - link to data source: first decision tree post.

data = pd.read_csv("ache_2d_chembl.csv")
data.drop(columns = ["Unnamed: 0"], inplace=True)
# Preparing data for compounds with max phase with "NaN" by re-labelling to "null"
data["max_phase"].fillna("null", inplace=True)
data.head()
Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'null' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
molecule_chembl_id pKi max_phase mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
0 CHEMBL60745 8.787812 null 245.041526 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0
1 CHEMBL208599 10.585027 null 298.123676 0.388889 2 2 4 3 21 ... 4.331775 2 0 2 1 1 2 0 0 0
2 CHEMBL95 6.821023 4.0 198.115698 0.307692 2 2 3 2 15 ... 2.014719 1 0 1 1 1 2 0 0 0
3 CHEMBL173309 7.913640 null 694.539707 0.666667 8 0 2 8 50 ... 2.803680 0 0 0 2 0 2 0 0 0
4 CHEMBL1128 6.698970 4.0 201.092042 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0

5 rows × 25 columns


Training/testing splits

Two approaches were used, where one was based purely on max phase split (between max phases null and 4), which was used last time in the decision tree series, and the other one was using the same max phase split but with an ImbalancedLearningRegression method added on top of it.


Preparing training data using max phase split

X variable was set up first from the dataframe, and then converted into a NumPy array, which consisted of the number of samples and number of features. This was kept the same as how it was in the decision tree posts.

Note

It’s usually recommended to copy the original data or dataframe before doing any data manipulations to avoid unnecessary changes to the original dataset (this was not used in the decision tree posts, but since I’m going to use the same set of data again I’m doing it here.)

# X variables (molecular features)
# Make a copy of the original dataframe first
data_mp4 = data.copy()
# Selecting all max phase 4 compounds
data_mp4 = data_mp4[data_mp4["max_phase"] == 4]
print(data_mp4.shape)
data_mp4.head()
(10, 25)
molecule_chembl_id pKi max_phase mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
2 CHEMBL95 6.821023 4.0 198.115698 0.307692 2 2 3 2 15 ... 2.014719 1 0 1 1 1 2 0 0 0
4 CHEMBL1128 6.698970 4.0 201.092042 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0
6 CHEMBL640 6.000000 4.0 235.168462 0.461538 4 3 1 4 17 ... 1.791687 0 0 0 1 0 1 0 0 0
9 CHEMBL502 7.688246 4.0 379.214744 0.458333 4 0 4 4 28 ... 2.677222 1 1 2 2 0 2 0 1 1
131 CHEMBL481 7.296709 4.0 586.279135 0.515152 10 1 7 10 43 ... 3.632560 0 4 4 1 2 3 0 2 2

5 rows × 25 columns

# Select molecular features for X array (n_samples, n_features)
X_mp4_df = data_mp4[['mw', 'fsp3', 'n_lipinski_hba', 'n_lipinski_hbd', 'n_rings', 'n_hetero_atoms', 'n_heavy_atoms', 'n_rotatable_bonds', 'n_radical_electrons', 'tpsa', 'qed', 'clogp', 'sas', 'n_aliphatic_carbocycles', 'n_aliphatic_heterocyles', 'n_aliphatic_rings', 'n_aromatic_carbocycles', 'n_aromatic_heterocyles', 'n_aromatic_rings', 'n_saturated_carbocycles', 'n_saturated_heterocyles', 'n_saturated_rings']]

print(X_mp4_df.shape)
X_mp4_df.head()
(10, 22)
mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms n_rotatable_bonds n_radical_electrons tpsa ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
2 198.115698 0.307692 2 2 3 2 15 0 0 38.91 ... 2.014719 1 0 1 1 1 2 0 0 0
4 201.092042 0.400000 2 1 1 3 13 2 0 20.23 ... 3.185866 0 0 0 1 0 1 0 0 0
6 235.168462 0.461538 4 3 1 4 17 6 0 58.36 ... 1.791687 0 0 0 1 0 1 0 0 0
9 379.214744 0.458333 4 0 4 4 28 6 0 38.77 ... 2.677222 1 1 2 2 0 2 0 1 1
131 586.279135 0.515152 10 1 7 10 43 4 0 114.20 ... 3.632560 0 4 4 1 2 3 0 2 2

5 rows × 22 columns

# Convert X_mp4_df to numpy array
X_mp4 = X_mp4_df.to_numpy()
X_mp4
array([[ 1.98115698e+02,  3.07692308e-01,  2.00000000e+00,
         2.00000000e+00,  3.00000000e+00,  2.00000000e+00,
         1.50000000e+01,  0.00000000e+00,  0.00000000e+00,
         3.89100000e+01,  7.06488238e-01,  2.69580000e+00,
         2.01471913e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
         2.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 2.01092042e+02,  4.00000000e-01,  2.00000000e+00,
         1.00000000e+00,  1.00000000e+00,  3.00000000e+00,
         1.30000000e+01,  2.00000000e+00,  0.00000000e+00,
         2.02300000e+01,  6.08112327e-01, -1.01700000e+00,
         3.18586632e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 2.35168462e+02,  4.61538462e-01,  4.00000000e+00,
         3.00000000e+00,  1.00000000e+00,  4.00000000e+00,
         1.70000000e+01,  6.00000000e+00,  0.00000000e+00,
         5.83600000e+01,  7.31539693e-01,  1.34040000e+00,
         1.79168720e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 3.79214744e+02,  4.58333333e-01,  4.00000000e+00,
         0.00000000e+00,  4.00000000e+00,  4.00000000e+00,
         2.80000000e+01,  6.00000000e+00,  0.00000000e+00,
         3.87700000e+01,  7.47461492e-01,  4.36110000e+00,
         2.67722173e+00,  1.00000000e+00,  1.00000000e+00,
         2.00000000e+00,  2.00000000e+00,  0.00000000e+00,
         2.00000000e+00,  0.00000000e+00,  1.00000000e+00,
         1.00000000e+00],
       [ 5.86279135e+02,  5.15151515e-01,  1.00000000e+01,
         1.00000000e+00,  7.00000000e+00,  1.00000000e+01,
         4.30000000e+01,  4.00000000e+00,  0.00000000e+00,
         1.14200000e+02,  3.55955569e-01,  4.09110000e+00,
         3.63256044e+00,  0.00000000e+00,  4.00000000e+00,
         4.00000000e+00,  1.00000000e+00,  2.00000000e+00,
         3.00000000e+00,  0.00000000e+00,  2.00000000e+00,
         2.00000000e+00],
       [ 5.10461822e+02,  8.00000000e-01,  6.00000000e+00,
         0.00000000e+00,  1.00000000e+00,  6.00000000e+00,
         3.60000000e+01,  2.10000000e+01,  0.00000000e+00,
         2.76900000e+01,  2.05822189e-01,  5.45250000e+00,
         3.25765349e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 1.84066459e+02,  1.00000000e+00,  3.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  5.00000000e+00,
         1.10000000e+01,  4.00000000e+00,  0.00000000e+00,
         3.55300000e+01,  6.29869319e-01,  2.91400000e+00,
         3.34514393e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 2.87152144e+02,  5.29411765e-01,  4.00000000e+00,
         1.00000000e+00,  4.00000000e+00,  4.00000000e+00,
         2.10000000e+01,  1.00000000e+00,  0.00000000e+00,
         4.19300000e+01,  8.00524269e-01,  1.85030000e+00,
         4.22684283e+00,  1.00000000e+00,  2.00000000e+00,
         3.00000000e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 3.48142697e+02,  3.68421053e-01,  2.00000000e+00,
         0.00000000e+00,  3.00000000e+00,  4.00000000e+00,
         2.30000000e+01,  5.00000000e+00,  0.00000000e+00,
         6.48000000e+00,  7.09785317e-01,  5.44140000e+00,
         4.22359068e+00,  0.00000000e+00,  1.00000000e+00,
         1.00000000e+00,  2.00000000e+00,  0.00000000e+00,
         2.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 2.34092376e+02,  3.07692308e-01,  2.00000000e+00,
         2.00000000e+00,  3.00000000e+00,  3.00000000e+00,
         1.60000000e+01,  0.00000000e+00,  0.00000000e+00,
         3.89100000e+01,  7.60853221e-01,  3.11760000e+00,
         3.21871482e+00,  1.00000000e+00,  0.00000000e+00,
         1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
         2.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00]])

Again, y variable was arranged via the dataframe as well, and converted into a NumPy array. It consisted of the number of samples only as this was the target variable.

# y array (n_samples) - target outcome pKi
y_mp4_df = data_mp4["pKi"]
y_mp4_df
2      6.821023
4      6.698970
6      6.000000
9      7.688246
131    7.296709
133    4.431798
160    5.221849
171    6.522879
180    4.607303
195    6.995679
Name: pKi, dtype: float64
# Convert y_mp4_df to numpy array
y_mp4 = y_mp4_df.to_numpy()
y_mp4
array([6.82102305, 6.69897   , 6.        , 7.68824614, 7.29670862,
       4.43179828, 5.22184875, 6.52287875, 4.60730305, 6.99567863])


Training model using max phase split only

Both X and y variables were used to fit the RandomForestRegressor() estimator.

# n_estimators = 100 by default
# note: if wanting to use whole dataset - switch off "bootstrap" parameter by using "False"
rfreg = RandomForestRegressor(max_depth=3, random_state=1, max_features=0.3)
rfreg.fit(X_mp4, y_mp4)
RandomForestRegressor(max_depth=3, max_features=0.3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Preparing testing data using max phase split only

Testing data was mainly based on compounds with max phase assigned as “0” or “null” after I renamed it above.

data_mp_null = data.copy()
# Selecting all max phase "null" compounds
data_mp_null = data_mp_null[data_mp_null["max_phase"] == "null"]
print(data_mp_null.shape)
data_mp_null.head() 
(466, 25)
molecule_chembl_id pKi max_phase mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
0 CHEMBL60745 8.787812 null 245.041526 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0
1 CHEMBL208599 10.585027 null 298.123676 0.388889 2 2 4 3 21 ... 4.331775 2 0 2 1 1 2 0 0 0
3 CHEMBL173309 7.913640 null 694.539707 0.666667 8 0 2 8 50 ... 2.803680 0 0 0 2 0 2 0 0 0
5 CHEMBL102226 4.698970 null 297.152928 0.923077 3 0 0 5 18 ... 2.965170 0 0 0 0 0 0 0 0 0
7 CHEMBL103873 5.698970 null 269.121628 0.909091 3 0 0 5 16 ... 3.097106 0 0 0 0 0 0 0 0 0

5 rows × 25 columns

# Set up X test variable with the same molecular features
X_mp_test_df = data_mp_null[['mw', 'fsp3', 'n_lipinski_hba', 'n_lipinski_hbd', 'n_rings', 'n_hetero_atoms', 'n_heavy_atoms', 'n_rotatable_bonds', 'n_radical_electrons', 'tpsa', 'qed', 'clogp', 'sas', 'n_aliphatic_carbocycles', 'n_aliphatic_heterocyles', 'n_aliphatic_rings', 'n_aromatic_carbocycles', 'n_aromatic_heterocyles', 'n_aromatic_rings', 'n_saturated_carbocycles', 'n_saturated_heterocyles', 'n_saturated_rings']]

# Convert X test variables from df to arrays
X_mp_test = X_mp_test_df.to_numpy()

X_mp_test
array([[2.45041526e+02, 4.00000000e-01, 2.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.98123676e+02, 3.88888889e-01, 2.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [6.94539707e+02, 6.66666667e-01, 8.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [3.11152144e+02, 3.15789474e-01, 4.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.68096076e+02, 9.23076923e-01, 4.00000000e+00, ...,
        0.00000000e+00, 2.00000000e+00, 2.00000000e+00],
       [2.46136828e+02, 5.00000000e-01, 4.00000000e+00, ...,
        0.00000000e+00, 3.00000000e+00, 3.00000000e+00]])


Training/testing splits using ImbalancedLearningRegression and max phase splits

I didn’t really pay a lot of attentions when I was doing data splits in the decision tree series, as my main focus was on building a single tree in order to fully understand and see what could be derived from just one tree. Now, when I reached this series on random forest, I realised I forgot to mention in the last series that data splitting was actually very crucial on model performance and could influence outcome predictions. It could also become quite complicated as more approaches were available to split the data. Also, the way the data was splitted could produce different outcomes.

After I’ve splitted the same dataset based on compounds’ max phase assignments in ChEMBL and also fitted the training data on the random forest regressor, I went back and noticed that the training and testing data were very imbalanced and I probably should do something about it before fitting them onto another model.

At this stage, I went further to look into whether imbalanced datasets should be addressed in regression tasks, and did a surface search online. So based on common ML concensus, addressing imbalanced datasets were more applicable to classification tasks (e.g. binary labels or multi-class labels), rather than regression problems. However, recent ML research looked into the issue of imbalanced datasets in regression. This blog post mentioned a few studies that looked into this type of problem, and I thought they were very interesting and worth a mention at least. One of them that I’ve looked into was SMOTER, which was based on synthetic minority over-sampling technique (SMOTE)(Chawla et al. 2002), and was named this way because it was basically a SMOTE for regression (hence SMOTER)(Torgo et al. 2013). Synthetic minority over-sampling technique for regression with Gaussian noise (SMOGN)(Kunz 2020) was another technique that was built upon SMOTER, but with Gaussian noises added. This has subsequently led me to ImbalancedLearningRegression library (Wu, Kunz, and Branco 2022), which was a variation of SMOGN. This was the one used on my imbalanced dataset, shown in the section below.

A simple flow diagram was drawn below showing the evolution of different techniques when dealing with imbalanced datasets in classification (SMOTE) and regression (SMOTER, SMOGN and ImbalancedLearningRegression):

flowchart LR
  A(SMOTE) --> B(SMOTER)
  B --> C(SMOGN)
  C --> D(ImbalancedLearningRegression)

GitHub repository for ImbalancedLearningRegression package is available here, with its documentation available here.

Also, I just wanted to mention that these were not the only techniques available for treating imbalanced datasets in regression, as there were other ones in the literature and most likely more are being developed currently, but I only had time to cover these here for now.

I also would like to mention another really useful open-source resource for treating imbalanced datasets in classifications since I did not use it in this post due to the problem being more of a regression one than a classification one - imbalance-learn library.

# Original dataset - checking shape again
print(data.shape)
data.head()
(481, 25)
molecule_chembl_id pKi max_phase mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
0 CHEMBL60745 8.787812 null 245.041526 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0
1 CHEMBL208599 10.585027 null 298.123676 0.388889 2 2 4 3 21 ... 4.331775 2 0 2 1 1 2 0 0 0
2 CHEMBL95 6.821023 4.0 198.115698 0.307692 2 2 3 2 15 ... 2.014719 1 0 1 1 1 2 0 0 0
3 CHEMBL173309 7.913640 null 694.539707 0.666667 8 0 2 8 50 ... 2.803680 0 0 0 2 0 2 0 0 0
4 CHEMBL1128 6.698970 4.0 201.092042 0.400000 2 1 1 3 13 ... 3.185866 0 0 0 1 0 1 0 0 0

5 rows × 25 columns

So my little test on using ImbalancedLearningRegression package started from below.

iblr_data = data.copy()

# Introducing Gaussian noise for data sampling
data_gn = iblr.gn(data = iblr_data, y = "pKi", pert = 1)
print(data_gn.shape)
synth_matrix:   0%|          | 0/58 [00:00<?, ?it/s]
synth_matrix:  14%|#3        | 8/58 [00:00<00:00, 75.97it/s]
synth_matrix:  31%|###1      | 18/58 [00:00<00:00, 84.29it/s]
synth_matrix:  47%|####6     | 27/58 [00:00<00:00, 85.90it/s]
synth_matrix:  62%|######2   | 36/58 [00:00<00:00, 83.02it/s]
synth_matrix:  78%|#######7  | 45/58 [00:00<00:00, 83.24it/s]
synth_matrix:  93%|#########3| 54/58 [00:00<00:00, 84.17it/s]
synth_matrix: 100%|##########| 58/58 [00:00<00:00, 84.10it/s]
r_index:   0%|          | 0/8 [00:00<?, ?it/s]
r_index: 100%|##########| 8/8 [00:00<00:00, 267.33it/s]
(480, 25)
# Followed by max phase split, where max phase 4 = training dataset
data_gn_mp4 = data_gn[data_gn["max_phase"] == 4]
data_gn_mp4
print(data_gn_mp4.shape)
(7, 25)
# Also splitted max phase null compounds = testing dataset
data_gn_mp_null = data_gn[data_gn["max_phase"] == "null"]
data_gn_mp_null
print(data_gn_mp_null.shape)
(465, 25)

There were several different sampling techniques in ImbalancedLearningRegression package. I’ve only tried random over-sampling, under-sampling and Gaussian noise, but there were also other ones such as SMOTE and ADASYN (in over-sampling technique) or condensed nearest neighbor, Tomeklinks and edited nearest neightbour (in under-sampling technique) that I haven’t used.

Random over-sampling actually oversampled the max phase null compounds (sample size increased), while keeping all 10 max phase 4 compounds. Under-sampling removed all of the max phase 4 compounds (which was most likely not the best option, since I was aiming to use them as training compounds), with max phase null compounds also reduced in size too. Due to post length, I did not show the code for random over-sampling and under-sampling, but for people who are interested, I think it would be interesting to test them out.

I ended up using Gauissian noise sampling and it reduced max phase 4 compounds slightly, and increased the max phase null compounds a little bit too, which seemed to be the most balanced data sampling at the first try. (Note: as stated from the documentation for ImbalancedLearningRegression package, missing values within features would be removed automatically, I’ve taken care of this in my last series of posts so no difference were observed here.)

The change in the distribution of pKi values for the Gaussian noise sampling method between the original and sample-modified datasets could be seen in the kernel density estimate plot below. The modified dataset had a flatter target density curve than the original density plot, which was more concentrated and peaked between pKi values of 6 and 8. The range of pKi values for the ten max phase 4 compounds collected was between 4 and 8.

Plot reference

# Quick look at how the pKi values differed 
# after applying Gaussian noise sampling to dataset
# Plot target variable, pKi distributions
sns.kdeplot(data["pKi"], label = "Original")
sns.kdeplot(data_gn["pKi"], label = "Modified")
plt.legend(labels = ["Original", "Modified"])
<matplotlib.legend.Legend at 0x13b87ae60>

Next, the modified ImbalancedLearningRegression-Gaussian noise (iblr-gn) training data was converted into a NumPy array.

# Select molecular features for X variable
X_mp4_gn_df = data_gn_mp4[['mw', 'fsp3', 'n_lipinski_hba', 'n_lipinski_hbd', 'n_rings', 'n_hetero_atoms', 'n_heavy_atoms', 'n_rotatable_bonds', 'n_radical_electrons', 'tpsa', 'qed', 'clogp', 'sas', 'n_aliphatic_carbocycles', 'n_aliphatic_heterocyles', 'n_aliphatic_rings', 'n_aromatic_carbocycles', 'n_aromatic_heterocyles', 'n_aromatic_rings', 'n_saturated_carbocycles', 'n_saturated_heterocyles', 'n_saturated_rings']]

print(X_mp4_gn_df.shape)
X_mp4_gn_df.head()
(7, 22)
mw fsp3 n_lipinski_hba n_lipinski_hbd n_rings n_hetero_atoms n_heavy_atoms n_rotatable_bonds n_radical_electrons tpsa ... sas n_aliphatic_carbocycles n_aliphatic_heterocyles n_aliphatic_rings n_aromatic_carbocycles n_aromatic_heterocyles n_aromatic_rings n_saturated_carbocycles n_saturated_heterocyles n_saturated_rings
244 348.142697 0.368421 2.0 0.0 3.0 4.0 23.0 5.0 0 6.48 ... 4.223591 0.0 1.0 1.0 2.0 0.0 2.0 0.0 0.0 0.0
247 201.092042 0.400000 2.0 1.0 1.0 3.0 13.0 2.0 0 20.23 ... 3.185866 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
266 184.066459 1.000000 3.0 0.0 0.0 5.0 11.0 4.0 0 35.53 ... 3.345144 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
365 198.115698 0.307692 2.0 2.0 3.0 2.0 15.0 0.0 0 38.91 ... 2.014719 1.0 0.0 1.0 1.0 1.0 2.0 0.0 0.0 0.0
385 235.168462 0.461538 4.0 3.0 1.0 4.0 17.0 6.0 0 58.36 ... 1.791687 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0

5 rows × 22 columns

X_mp4_gn = X_mp4_gn_df.to_numpy()

Similarly, this was also applied to the target y variable in the iblr-gn dataset.

# y variable (target outcome - pKi)
y_mp4_gn_df = data_gn_mp4["pKi"]

y_mp4_gn = y_mp4_gn_df.to_numpy()
y_mp4_gn
array([4.60730305, 6.69897   , 5.22184875, 6.82102305, 6.        ,
       7.68824614, 6.99567863])

Then the iblr-gn training data were fitted onto another random forest regressor model.

# n_estimators = 100 by default
# note: if wanting to use whole dataset - switch off "bootstrap" parameter by using "False"
rfreg_gn = RandomForestRegressor(max_depth=3, random_state=1, max_features=0.3)
rfreg_gn.fit(X_mp4_gn, y_mp4_gn)
RandomForestRegressor(max_depth=3, max_features=0.3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Modified iblr-gn testing data were also prepared and converted into a NumPy array.

# Set up X test variable with the same molecular features
X_mp_gn_test_df = data_gn_mp_null[['mw', 'fsp3', 'n_lipinski_hba', 'n_lipinski_hbd', 'n_rings', 'n_hetero_atoms', 'n_heavy_atoms', 'n_rotatable_bonds', 'n_radical_electrons', 'tpsa', 'qed', 'clogp', 'sas', 'n_aliphatic_carbocycles', 'n_aliphatic_heterocyles', 'n_aliphatic_rings', 'n_aromatic_carbocycles', 'n_aromatic_heterocyles', 'n_aromatic_rings', 'n_saturated_carbocycles', 'n_saturated_heterocyles', 'n_saturated_rings']]

# Convert X test variables from df to arrays
X_mp_gn_test = X_mp_gn_test_df.to_numpy()

X_mp_gn_test
array([[5.80224119e+02, 2.64705882e-01, 7.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.77210327e+02, 3.91304348e-01, 5.00000000e+00, ...,
        0.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       [5.24297368e+02, 4.54545455e-01, 4.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [3.37167794e+02, 2.85714286e-01, 4.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.09129751e+02, 4.16666667e-01, 9.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.40189926e+02, 4.21052632e-01, 6.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])


Using trained model for prediction on testing data

Predicting max phase-splitted data only.

# Predict pKi values for the compounds with "null" max phase
# using the training model rfreg 
# Uncomment code below to print prediction result
#print(rfreg.predict(X_mp_test))

# or use:
y_mp_test = rfreg.predict(X_mp_test)

Predicting iblr-gn data with max phase splits.

y_mp_gn_test = rfreg_gn.predict(X_mp_gn_test)


Scoring and metrics of trained models

Checking model accuracy for both training and testing datasets was recommended to take place before moving onto discovering feature importances. A scikit-learn explanation for this could be found in the section on “Permutation feature importance”. So the accuracy scores for the model were shown below.

# Training set accuracy
print(f"Random forest regressor training accuracy: {rfreg.score(X_mp4, y_mp4):.2f}")

# Testing set accuracy
print(f"Random forest regressor testing accuracy: {rfreg.score(X_mp_test, y_mp_test):.2f}")
Random forest regressor training accuracy: 0.86
Random forest regressor testing accuracy: 1.00

It looked like both the training and testing accuracies for the random forest regressor model (rfreg) were quite high, meaning that the model was able to remember the molecular features well from the training set (the tiny sample of 10 compounds), and the model was able to apply them to the testing set (which should contain about 400 or so compounds) as well, in order to make predictions on the target value of pKi. This has somewhat confirmed that the model was indeed making predictions, rather than not making any predictions at all, which meant there might be no point in finding out which features were important in the data. Therefore, we could now move onto processing the feature importances to fill in the bigger story i.e. which features were more pivotal towards influencing pKi values of approved drugs targeting acetylcholinesterase (AChE).

Similar model accuracy scores were also generated for the iblr-gn modified dataset, which appeared to follow a similar pattern as the max phase-splitted dataset.

# iblr-Gaussian noise & max phase splitted data
# Training set accuracy
print(f"Random forest regressor training accuracy: {rfreg_gn.score(X_mp4_gn, y_mp4_gn):.2f}")

# Testing set accuracy
print(f"Random forest regressor testing accuracy: {rfreg_gn.score(X_mp_gn_test, y_mp_gn_test):.2f}")
Random forest regressor training accuracy: 0.79
Random forest regressor testing accuracy: 1.00

Now, setting up the y_true, which was the acutal pKi values of the testing set, and were converted into a NumPy array too.

y_true = data_mp_null["pKi"]
y_true = y_true.to_numpy(copy=True)

I also found out the mean squared error (MSE) between y_true (actual max phase null compounds’ pKi values) and y_pred (predicted max phase null compounds’ pKi values). When MSE was closer to zero, the better the model was, meaning less errors were present.

Some references that might help with explaining MSE:

# For max phase splitted dataset only
mean_squared_error(y_true, y_mp_test)
2.3988097789702505

When R2 (coefficient of determination) was closer to 1, the better the model is, with a usual range between 0 and 1 (Bruce, Bruce, and Gedeck 2020). If it was negative, then the model might not be performing as well as expected. However, there could be exceptions as other model evaluation methods should also be interpreted together with R2 (a poor R2 might not be wholly indicating it’s a poor model).

Some references that might help with understanding R2:

# For max phase splitted dataset only
r2_score(y_true, y_mp_test)
-0.16228227953132635

Because the data was re-sampled in a iblr-gn way, the size of array would be different from the original dataset, so here I’ve specifically grabbed pKi values from the iblr-gn modified data to get the actual pKi values for the max phase null compounds.

y_true_gn = data_gn_mp_null["pKi"]
y_true_gn = y_true_gn.to_numpy(copy=True)
# MSE for iblr-gn data
mean_squared_error(y_true_gn, y_mp_gn_test)
5.7895732090189185
# R squared for iblr-gn data
r2_score(y_true_gn, y_mp_gn_test)
-0.7425920410726885

Well, it appeared iblr-gn dataset might not offer much advantage than the original max phase splitted method. However, even the max phase splitted method wasn’t that great either, but it might still be interesting to find out which features were important in relation to the pKi values.


Feature importances

There were two types of feature importances available in scikit-learn, which I’ve described below. I’ve also added a Shapley additive explanations (SHAP) approach to this section as well to show different visualisation styles for feature importances on the same set of data.


feature_importances_ attribute from scikit-learn

The impurity-based feature importances (also known as Gini importance) were shown below.

# Compute feature importances on rfreg training model
feature_imp = rfreg.feature_importances_
# Check what feature_imp looks like (an array)
feature_imp
array([0.0396524 , 0.06608031, 0.01129602, 0.09707957, 0.03625411,
       0.04393835, 0.05287614, 0.03339396, 0.        , 0.14225505,
       0.03562854, 0.16945464, 0.04450665, 0.02810539, 0.01437198,
       0.0220527 , 0.02136604, 0.00803076, 0.04866529, 0.        ,
       0.044468  , 0.04052409])

I decided to write a function to convert a NumPy array into a plot below as this was also needed in the next section.

# Function to convert array to df leading to plots 
# - for use in feature_importances_ & permutation_importance

def feat_imp_plot(feat_imp_array, X_df):

    """
    Function to convert feature importance array into a dataframe, 
    which is then used to plot a bar graph 
    to show the feature importance ranking in the random forest model for the dataset used.

    feat_imp_array is the array obtained from the feature_importances_ attribute, 
    after having a estimator/model fitted.

    X_df is the dataframe for the X variable, 
    where the feature column names will be used in the plot.
    """

    # Convert the feat_imp array into dataframe
    feat_imp_df = pd.DataFrame(feat_imp_array)
    
    # Obtain feature names via column names of dataframe
    # Rename the index as "features"
    feature = X_df.columns.rename("features")

    # Convert the index to dataframe
    feature_name_df = feature.to_frame(index = False)

    # Concatenate feature_imp_df & feature_name_df
    feature_df = pd.concat(
        [feat_imp_df, feature_name_df], 
        axis=1
        ).rename(
            # Rename the column for feature importances
            columns = {0: "feature_importances"}
            ).sort_values(
                # Sort values of feature importances in descending order
                "feature_importances", ascending=False
                )
    
    # Seaborn bar plot
    sns.barplot(
        feature_df, 
        x = "feature_importances", 
        y = "features")
# Testing feat_imp_plot function
feat_imp_plot(feature_imp, X_mp4_df)

An alternative way to plot was via Matplotlib directly (note: Seaborn was built based on Matplotlib, so the plots were pretty similar). The code below were probably a bit more straightforward but without axes named and the values were not sorted (only as an example but more code could be added to do this).

# Matplotlib plot
from matplotlib import pyplot as plt
plt.barh(X_mp4_df.columns, rfreg.feature_importances_)
<BarContainer object of 22 artists>


permutation_importance function from scikit-learn

There were known issues with the built-in feature_importances_ attribute in scikit-learn. As quoted from scikit-learn on feature importance evaluation:

… The impurity-based feature importances computed on tree-based models suffer from two flaws that can lead to misleading conclusions. First they are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on held-out dataset. Secondly, they favor high cardinality features, that is features with many unique values. Permutation feature importance is an alternative to impurity-based feature importance that does not suffer from these flaws. …

So I’ve also tried the permutation_importance function (a model-agnostic method).

perm_result = permutation_importance(rfreg, X_mp_test, y_mp_test, n_repeats=10, random_state=1, n_jobs=2)

# Checking data type of perm_result
type(perm_result)
sklearn.utils._bunch.Bunch

It normally returns a dictionary-like objects (e.g. Bunch) with the following 3 attributes:

  • importances_mean (mean of feature importances)
  • importances_std (standard deviation of feature importances)
  • importances (raw permutation/feature importances scores)

For details on these attributes, this scikit-learn link will add a bit more explanations.

I decided to only use importances_mean for now.

perm_imp = perm_result.importances_mean

# Confirm it produces an array
type(perm_imp)
numpy.ndarray
# Using the function feat_imp_plot() on perm_imp result to show plot
feat_imp_plot(perm_imp, X_mp4_df)

It generated a different feature importances ranking (if looking at top 6 features), although somewhat similar to the previous one.


SHAP approach

SHAP values (Lundberg et al. 2020), (Shapley et al. 1953) were used here to provide another way to figure out feature importances. The GitHub repository for this SHAP approach could be accessed here.

SHAP’s TreeExplainer() was based on Tree SHAP algorithms (Lundberg et al. 2020), and was used to show and explain feature importances within tree models. It could also be extended to boosted tree models such as LightGBM and XGBoost and also other tree models (as explained by the GitHub repository README.md and its documentation link provided). It was also a model-agnostic method, which could be quite handy.

Other reference

shap_explainer = shap.TreeExplainer(rfreg)

# X_test needs to be a dataframe (not numpy array)
# otherwise feature names won't show in plot
shap_values = shap_explainer.shap_values(X_mp_test_df)

# Horizontal bar plot
shap.summary_plot(shap_values, X_mp_test_df, plot_type = "bar")

Dot plot version:

shap.summary_plot(shap_values, X_mp_test_df)

Violin plot:

shap.summary_plot(shap_values, X_mp_test_df, plot_type = "violin")

# Alternative plot option: "layered_violin"


Hyperparameter tuning

An example was shown below on tuning the number of trees (n_estimators) used in the random forest model.

# Function code adapted with thanks from ML Mastery 
# https://machinelearningmastery.com/random-forest-ensemble-in-python/

# ---Evaluate a list of models with different number of trees---

# Define dataset by using the same training dataset as above
X, y = X_mp4, y_mp4

# Define function to generate a list of models with different no. of trees
def models():
    # Create empty dictionary (key, value pairs) for models
    models = dict()
    # Test different number of trees to evaluate
    no_trees = [50, 100, 250, 500, 1000]
    for n in no_trees:
        models[str(n)] = RandomForestRegressor(n_estimators=n)
    return models


# Define function to evaluate a single model using cross-validation
def evaluate(model, X, y):

    # RepeatedStratifiedKFold usually for binary or multi-class labels 
    # - ref link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
    # so using ReaptedKFold instead
    cross_val = RepeatedKFold(n_splits=10, n_repeats=15, random_state=1)
    # Run evaluation process & collect cv scores
    # Since estimator/model was based on DecisionTreeRegressor, 
    # using neg_mean_squared_error metric
    # n_jobs = -1 meaning using all processors to run jobs in parallel
    scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=cross_val, n_jobs=-1)
    return scores


# Evaluate results
# Run models with different RepeatedKFold & different no. of tress
# with results shown as diff. trees with calculated mean cv scores & std

# Obtain diff. models with diff. trees via models function
models = models()

# Create empty lists for results & names
results, names = list(), list()

# Create a for loop to iterate through the list of diff. models
for name, model in models.items():
    # Run the cross validation scores via evaluate function
    scores = evaluate(model, X, y)
    # Collect results
    results.append(scores)
    # Collect names (different no. of trees)
    names.append(name)
    # Show the average mean squared errors and corresponding standard deviations 
    # for each model with diff. no. of trees
    print((name, mean(scores), std(scores)))
('50', -1.6470594650953017, 1.6444082604560304)
('100', -1.6995136024743887, 1.6797340671624852)
('250', -1.6716290617106646, 1.6236808789148038)
('500', -1.645981936868625, 1.615445700037851)
('1000', -1.6532678610618743, 1.604259597928101)

The negated version of the mean squared error (neg_mean_squared_error) was due to how the scoring parameter source code was written in scikit-learn. It was written this way to take into account of both scoring and loss functions (links provided below for further explanations). All scoring metrics could be accessed here for scikit-learn.

Reference links to help with understanding neg_mean_squared_error:

  1. scikit-learn source code

  2. StackOverflow answer

Also, the random forest algorithm was stochastic in nature, meaning that every time hyperparameter tuning took place, it would generate different scores due to random bootstrap sampling. The best approach to evaluate model performance during the cross-validation process was to use the average outcome from several runs of cross-validations, then fit the hyperparameters on a final model, or getting several final models ready and then obtaining the average from these models instead.

Below was a version of boxplot plotted using Matplotlib showing the differences in the distributions of the cross validation scores and mean squared errors between different number of trees.

plt.boxplot(results, labels=names, showmeans=True)
plt.show()

To plot this in Seaborn, I had to prepare the data slightly differently to achieve a different version of the boxplot. Matplotlib was a bit more straightforward to use without these steps.

I also used natural sort to sort numerical values (GitHub repository). Otherwise, if using sort_values() only, it would only sort the numbers in lexicographical order (i.e. by first digit only), which was not able to show the tree numbers in ascending order.

# Combine results & names lists into dataframe
cv_results = pd.DataFrame(results, index = [names])
# Reset index and rename the number of trees column
cv_results = cv_results.reset_index().rename(columns={"level_0": "Number_of_trees"})
# Melt the dataframe by number of trees column
cv_results = cv_results.melt(id_vars="Number_of_trees")
# Sort by the number of trees column
cv_results = cv_results.sort_values(
    by="Number_of_trees",
    key=lambda x: np.argsort(index_natsorted(cv_results["Number_of_trees"]))
)
# Seaborn boxplot
sns.boxplot(cv_results, x="Number_of_trees", y="value", showmeans=True)
<Axes: xlabel='Number_of_trees', ylabel='value'>

The Seaborn boxplot shown should be very similar to the Matplotlib one.

Other hyperparameters that could be tuned included:

  • tree depths (max_depth)

  • number of samples (max_samples)

  • number of features (max_features) - I didn’t use RDKit to generate molecular features for this post (Datamol version was used instead) which would provide around 209 at least (trying to keep the post at a readable length), but I think this might be a better option when doing cross-validations in model evaluations

  • number of nodes (max_leaf_nodes)

I’ve decided not to code for these other hyperparameters in the cross-validation step due to length of post (the function code used in cross-validation above could be further adapted to cater for other hyperparameters mentioned here), but they should be looked into if doing full-scale and comprehensive ML using the ensemble random forest algorithm.


Final words

Random forest was known to be a black-box ML algorithm (Bruce, Bruce, and Gedeck 2020), which was completely different from the white-box ML style revealed in decision tree graphs. Feature importances was therefore crucial to shed some lights and remove some layers of the black-box nature in random forest by showing which features were contributing towards model accuracy by ranking features used to train the model. Cross-validation was also vital to avoid over-fitting (which was more applicable to depth of trees), although in some other cases (e.g. number of trees), it was mentioned that it was unlikely the model would be overfitted. Other options available in scikit-learn ensemble methods that I didn’t get time to try were using voting classifier/regressor and stacking models to reduce biases in models, which might be very useful in other cases.

Few things I’ve thought of that I could try to improve what I did here was that I should really look for a different set of testing data, rather than using the max phase splits, which was not that ideal. However, as a lot of us are aware, good drug discovery data are hard to come by (a long-standing and complicated problem), I probably need some luck while looking for a different set of drug discovery data later. Another approach that I could try was that I could use RandomForestClassifier() instead on max phase prediction of these small molecules, rather than making pKi value predictions. This might involve re-labelling the max phases for these compounds into a binary or class labels, then I could use the imbalance-learn package to try and alleviate the problem with imbalanced datasets. Nevertheless, I had some fun working on this post and learnt a lot while doing it, and I hope some of the readers might find this post helpful or informative at least.


Acknowledgement

I’d like to thank all the authors, developers and contributors who worked towards all of the open-source packages or libraries used in this post. I’d also like to thank all of the other senior cheminformatics and ML practitioners who were sharing their work and knowledge online.

References

Breiman, Leo. 1998. “Arcing Classifier (with Discussion and a Rejoinder by the Author).” The Annals of Statistics 26 (3). https://doi.org/10.1214/aos/1024691079.
———. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/a:1010933404324.
Bruce, Peter, Andrew Bruce, and Peter Gedeck. 2020. Practical Statistics for Data Scientists. https://www.oreilly.com/library/view/practical-statistics-for/9781492072935/.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16 (June): 321–57. https://doi.org/10.1613/jair.953.
Kunz, Nicholas. 2020. SMOGN: Synthetic Minority over-Sampling Technique for Regression with Gaussian Noise (version v0.1.2). PyPI. https://pypi.org/project/smogn/.
Lundberg, Scott M., Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence 2 (1): 2522–5839.
Shapley, Lloyd S et al. 1953. “A Value for n-Person Games.”
Torgo, Luís, Rita P. Ribeiro, Bernhard Pfahringer, and Paula Branco. 2013. “SMOTE for Regression.” In, 378–89. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-40669-0_33.
Wu, Wenglei, Nicholas Kunz, and Paula Branco. 2022. “ImbalancedLearningRegression-a Python Package to Tackle the Imbalanced Regression Problem.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 645–48. Springer.