Molecular visualisation (Molviz) web application

Interactive data table - part 1

Python
Datamol
Pandas
Polars
itables
Author

Jennifer HY Lin

Published

August 10, 2023

Introduction

This time I’m trying to build a web application in the hope to contribute my two cents towards minimising the gap between computational and laboratory sides in a drug discovery (or chemistry-related work) setting. There are definitely many other useful web applications available out there for similar uses, but I guess each one has its own uniqueness and special features.

For this app, it is mostly aimed at lab chemists who do not use any computer programming code at all in their work, and would like to quickly view compounds in the lab while working, and also to be able to highlight compound substructures during discussions or brainstorming for ideas during compound synthesis. Importantly, this app can exist outside a Jupyter notebook environment with internet required to access the app.

This is also the first part prior to the next post which will showcase the actual app. This part mainly involves some data cleaning but not as a major focus for this post. This is not to say that data cleaning is not important, but instead they are fundamental to any work involving data in order to ensure reasonable data quality, which can then potentially influence decisions or results. I have also collapsed the code sections below to make the post easier to read (to access code used for each section, click on the “Code” links).


Code and explanations

It was actually surprisingly simple for this first part when I did it - building the interactive table. I came across this on LinkedIn on a random day for a post about itables being integrated with Shiny for Python plus Quarto. It came at the right time because I was actually trying to build this app. I quickly thought about incorporating it with the rest of the app so that users could refer back to the data quickly while visualising compound images. The code and explanations on building an interactive table for dataframes in Pandas and Polars were provided below.

To install itables, visit here for instructions and also for other information about supported notebook editors.

Code
# Import dataframe libraries
import pandas as pd
import polars as pl

# Import Datamol
import datamol as dm

# Import itables
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)


# Option 1: Reading df_ai.csv as a pandas dataframe
#df = pd.read_csv("df_ai.csv")
#df.head

# Option 2: Reading df_ai.csv as a polars dataframe
df = pl.read_csv("df_ai.csv")
#df.head()


# Below was the code I used in my last post to fix the missing SMILES for neomycin 
# - the version below was edited due to recent updates in Polars
# Canonical SMILES for neomycin was extracted from PubChem 
# (https://pubchem.ncbi.nlm.nih.gov/compound/Neomycin)

df = df.with_columns([
    pl.when(pl.col("Smiles").str.lengths() == 0)
    .then(pl.lit("C1C(C(C(C(C1N)OC2C(C(C(C(O2)CN)O)O)N)OC3C(C(C(O3)CO)OC4C(C(C(C(O4)CN)O)O)N)O)O)N"))
    .otherwise(pl.col("Smiles"))
    .keep_name()
])

#df.head()

Polars dataframe library was designed without the index in mind (which is different to Pandas), therefore the itables library did not work on my specific polars dataframe which required an index column to show (note: all other Polars dataframes should work fine with itables without the index column!).

However to show row counts in Polars dataframes, we could use with_row_count() that starts the index from 0, and this would show up in a Jupyter environment as usual. A small code example would be like this below.

Code
# Uncomment below to run
#df = df.with_row_count()

Then I converted the Polars dataframe into a Pandas one (this could be completely avoided if you started with Pandas actually).

Code
df = df.to_pandas()

Then I added Datamol’s “_preprocess” function to convert SMILES1 into other molecular representations such as standardised SMILES (pre-processed and cleaned SMILES), SELFIES2, InChI3, InChI keys - just to provide extra information for further uses if needed. The standardised SMILES generated here would then be used for generating the molecule images later (in part 2).

Code
# Pre-process molecules using _preprocess function - adapted from datamol.io

smiles_column = "Smiles"

dm.disable_rdkit_log()

def _preprocess(row):
    mol = dm.to_mol(row[smiles_column], ordered=True)
    mol = dm.fix_mol(mol)
    mol = dm.sanitize_mol(mol, sanifix=True, charge_neutral=False)
    mol = dm.standardize_mol(
        mol,
        disconnect_metals=False,
        normalize=True,
        reionize=True,
        uncharge=False,
        stereo=True,
    )

    row["standard_smiles"] = dm.standardize_smiles(dm.to_smiles(mol))
    row["selfies"] = dm.to_selfies(mol)
    row["inchi"] = dm.to_inchi(mol)
    row["inchikey"] = dm.to_inchikey(mol)
    return row

#Apply the _preprocess function to the prepared dataframe.
df = df.apply(_preprocess, axis = 1)
#df.head()

The next step was to keep the index column of the Pandas dataframe as an actual column (there was a reason for this, mainly for the app).

Code
# Convert index of Pandas df into a column
df = df.reset_index()
#df.head()


Interactive data table
  • An interactive table of all the prescription-only antibiotics from ChEMBL is shown below

  • Scroll the table from left to right to see the SMILES, standardised SMILES, SELFIES, InChI, InChI keys for each compound

  • Use the search box to enter compound names to search for antibiotics and move between different pages when needed

Code
df
index Name Smiles standard_smiles selfies inchi inchikey
Loading... (need help?)


# Saving cleaned df_ai.csv as a new .csv file (for app_image_x.py - script to build the web app)
# df = pl.from_pandas(df)
# df.write_csv("df_ai_cleaned.csv", sep = ",")


Options for app deployment

Since I had a lot of fun deploying my previous app in Shinylive last time, I thought I might try the same this time - deploying the Molviz app as a Shinylive app in Quarto. However, it didn’t work as expected, with reason being that RDKit wasn’t written in pure Python (it was written in Python and C++), so there wasn’t a pure Python wheel file available in PyPi - this link may provide some answers relating to this. Essentially, packages or libraries used in the app will need to be compatible with Pyodide in order for the Shinylive app to work. So, the most likely option to deploy this app now would be in Shinyapps.io or HuggingFace as I read about it recently.


Next post

Code and explanations for the actual Molviz app will be detailed in the next post. To access full code and files used for now, please visit this repository link.

Footnotes

  1. Simplified Molecular Input Line Entry Systems↩︎

  2. SELF-referencIng Embedded Strings↩︎

  3. International Chemical Identifier↩︎