import polars as pl
Brief introduction
Since I’ve had a lot of fun building a Shiny app in R last time, I was on track to build another Shiny app again but using Python instead. So here in this post, I’ll talk about the data wrangling process to prepare the final dataset needed to build a Shinylive app in Python. The actual Shinylive app deployment and access will be shown in a separate post after this one.
Source of data
The dataset used for this Shiny app in Python was from PubChem (link here). There were a total of 631 compounds at the time when I downloaded them as .csv file, along with their relevant compound data. I only picked this dataset randomly, as the focus would be more on app building, but it was nice to see an interactive web app being built and used for a domain such as pharmaceutical research.
Import Polars
Polars dataframe library was used again this time.
Reading .csv file
= pl.read_csv("pubchem.csv")
pc pc.head()
cid | cmpdname | cmpdsynonym | mw | mf | polararea | complexity | xlogp | heavycnt | hbonddonor | hbondacc | rotbonds | inchi | isosmiles | canonicalsmiles | inchikey | iupacname | exactmass | monoisotopicmass | charge | covalentunitcnt | isotopeatomcnt | totalatomstereocnt | definedatomstereocnt | undefinedatomstereocnt | totalbondstereocnt | definedbondstereocnt | undefinedbondstereocnt | pclidcnt | gpidcnt | meshheadings | annothits | annothitcnt | aids | cidcdate | sidsrcname | depcatg | annotation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | f64 | str | f64 | f64 | str | i64 | i64 | i64 | i64 | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | i64 | str | i64 | str | str | str |
5280453 | "Calcitriol" | "calcitriol|322... | 416.6 | "C27H44O3" | 60.7 | 688.0 | "5.100" | 30 | 3 | 3 | 6 | "InChI=1S/C27H4... | "C[C@H](CCCC(C)... | "CC(CCCC(C)(C)O... | "GMRQFYUYWCNGIN... | "(1R,3S,5Z)-5-[... | 416.329 | 416.329 | 0 | 1 | 0 | 6 | 6 | 0 | 2 | 2 | 0 | 22311 | 46029 | "Calcitriol" | "Biological Tes... | 12 | "485|631|731|78... | 20040916 | "A2B Chem|AA BL... | "Chemical Vendo... | "COVID-19, COVI... |
9962735 | "Ubiquinol" | "ubiquinol|992-... | 865.4 | "C59H92O4" | 58.9 | 1600.0 | "20.200" | 63 | 2 | 4 | 31 | "InChI=1S/C59H9... | "CC1=C(C(=C(C(=... | "CC1=C(C(=C(C(=... | "QNTNKSLOFHEFPK... | "2-[(2E,6E,10E,... | 864.7 | 864.7 | 0 | 1 | 0 | 0 | 0 | 0 | 9 | 9 | 0 | 2732 | 21358 | "NULL" | "Chemical and P... | 7 | "NULL" | 20061025 | "001Chemical|A2... | "Chemical Vendo... | "COVID-19, COVI... |
5961 | "Glutamine" | "L-glutamine|gl... | 146.14 | "C5H10N2O3" | 106.0 | 146.0 | "-3.100" | 10 | 3 | 4 | 4 | "InChI=1S/C5H10... | "C(CC(=O)N)[C@@... | "C(CC(=O)N)C(C(... | "ZDXPYRJPNDTMRX... | "(2S)-2,5-diami... | 146.069 | 146.069 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 88218 | 399 | "Glutamine" | "Biological Tes... | 12 | "422|429|436|54... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
2244 | "Aspirin" | "aspirin|ACETYL... | 180.16 | "C9H8O4" | 63.6 | 212.0 | "1.200" | 13 | 1 | 4 | 3 | "InChI=1S/C9H8O... | "CC(=O)OC1=CC=C... | "CC(=O)OC1=CC=C... | "BSYNRYMUTXBXSQ... | "2-acetyloxyben... | 180.042 | 180.042 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 127012 | 364455 | "Aspirin" | "Biological Tes... | 12 | "1|3|9|15|19|21... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
457 | "1-Methylnicoti... | "1-methylnicoti... | 137.16 | "C7H9N2O+" | 47.0 | 136.0 | "-0.100" | 10 | 1 | 1 | 1 | "InChI=1S/C7H8N... | "C[N+]1=CC=CC(=... | "C[N+]1=CC=CC(=... | "LDHMAVIPBRSVRG... | "1-methylpyridi... | 137.071 | 137.071 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 310 | 674 | "NULL" | "Biological Tes... | 8 | "61001|61002|14... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
Quick look at the data
I decided to comment out the code below to keep the post at a reasonable length for reading purpose, but they were very handy for a quick glimpse of the data content.
# Quick overview of the variables in each column in the dataset
# Uncomment line below if needed to run
#print(pc.glimpse())
# Quick look at all column names
# Uncomment line below if needed to run
#pc.columns
Check for nulls in dataset
pc.null_count()
cid | cmpdname | cmpdsynonym | mw | mf | polararea | complexity | xlogp | heavycnt | hbonddonor | hbondacc | rotbonds | inchi | isosmiles | canonicalsmiles | inchikey | iupacname | exactmass | monoisotopicmass | charge | covalentunitcnt | isotopeatomcnt | totalatomstereocnt | definedatomstereocnt | undefinedatomstereocnt | totalbondstereocnt | definedbondstereocnt | undefinedbondstereocnt | pclidcnt | gpidcnt | meshheadings | annothits | annothitcnt | aids | cidcdate | sidsrcname | depcatg | annotation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Change column names as needed
# Change column names
= pc.rename(
pc_cov
{"cmpdname": "Compound name",
"cmpdsynonym": "Synonyms",
"mw": "Molecular weight",
"mf": "Molecular formula",
"polararea": "Polar surface area",
"complexity": "Complexity",
"xlogp": "Partition coefficients",
"heavycnt": "Heavy atom count",
"hbonddonor": "Hydrogen bond donor count",
"hbondacc": "Hydrogen bond acceptor count",
"rotbonds": "Rotatable bond count",
"exactmass": "Exact mass",
"monoisotopicmass": "Monoisotopic mass",
"charge": "Formal charge",
"covalentunitcnt": "Covalently-bonded unit count",
"isotopeatomcnt": "Isotope atom count",
"totalatomstereocnt": "Total atom stereocenter count",
"definedatomstereocnt": "Defined atom stereocenter count",
"undefinedatomstereocnt": "Undefined atoms stereocenter count",
"totalbondstereocnt": "Total bond stereocenter count",
"definedbondstereocnt": "Defined bond stereocenter count",
"undefinedbondstereocnt": "Undefined bond stereocenter count",
"meshheadings": "MeSH headings"
}
)
pc_cov.head()
cid | Compound name | Synonyms | Molecular weight | Molecular formula | Polar surface area | Complexity | Partition coefficients | Heavy atom count | Hydrogen bond donor count | Hydrogen bond acceptor count | Rotatable bond count | inchi | isosmiles | canonicalsmiles | inchikey | iupacname | Exact mass | Monoisotopic mass | Formal charge | Covalently-bonded unit count | Isotope atom count | Total atom stereocenter count | Defined atom stereocenter count | Undefined atoms stereocenter count | Total bond stereocenter count | Defined bond stereocenter count | Undefined bond stereocenter count | pclidcnt | gpidcnt | MeSH headings | annothits | annothitcnt | aids | cidcdate | sidsrcname | depcatg | annotation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | f64 | str | f64 | f64 | str | i64 | i64 | i64 | i64 | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | i64 | str | i64 | str | str | str |
5280453 | "Calcitriol" | "calcitriol|322... | 416.6 | "C27H44O3" | 60.7 | 688.0 | "5.100" | 30 | 3 | 3 | 6 | "InChI=1S/C27H4... | "C[C@H](CCCC(C)... | "CC(CCCC(C)(C)O... | "GMRQFYUYWCNGIN... | "(1R,3S,5Z)-5-[... | 416.329 | 416.329 | 0 | 1 | 0 | 6 | 6 | 0 | 2 | 2 | 0 | 22311 | 46029 | "Calcitriol" | "Biological Tes... | 12 | "485|631|731|78... | 20040916 | "A2B Chem|AA BL... | "Chemical Vendo... | "COVID-19, COVI... |
9962735 | "Ubiquinol" | "ubiquinol|992-... | 865.4 | "C59H92O4" | 58.9 | 1600.0 | "20.200" | 63 | 2 | 4 | 31 | "InChI=1S/C59H9... | "CC1=C(C(=C(C(=... | "CC1=C(C(=C(C(=... | "QNTNKSLOFHEFPK... | "2-[(2E,6E,10E,... | 864.7 | 864.7 | 0 | 1 | 0 | 0 | 0 | 0 | 9 | 9 | 0 | 2732 | 21358 | "NULL" | "Chemical and P... | 7 | "NULL" | 20061025 | "001Chemical|A2... | "Chemical Vendo... | "COVID-19, COVI... |
5961 | "Glutamine" | "L-glutamine|gl... | 146.14 | "C5H10N2O3" | 106.0 | 146.0 | "-3.100" | 10 | 3 | 4 | 4 | "InChI=1S/C5H10... | "C(CC(=O)N)[C@@... | "C(CC(=O)N)C(C(... | "ZDXPYRJPNDTMRX... | "(2S)-2,5-diami... | 146.069 | 146.069 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 88218 | 399 | "Glutamine" | "Biological Tes... | 12 | "422|429|436|54... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
2244 | "Aspirin" | "aspirin|ACETYL... | 180.16 | "C9H8O4" | 63.6 | 212.0 | "1.200" | 13 | 1 | 4 | 3 | "InChI=1S/C9H8O... | "CC(=O)OC1=CC=C... | "CC(=O)OC1=CC=C... | "BSYNRYMUTXBXSQ... | "2-acetyloxyben... | 180.042 | 180.042 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 127012 | 364455 | "Aspirin" | "Biological Tes... | 12 | "1|3|9|15|19|21... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
457 | "1-Methylnicoti... | "1-methylnicoti... | 137.16 | "C7H9N2O+" | 47.0 | 136.0 | "-0.100" | 10 | 1 | 1 | 1 | "InChI=1S/C7H8N... | "C[N+]1=CC=CC(=... | "C[N+]1=CC=CC(=... | "LDHMAVIPBRSVRG... | "1-methylpyridi... | 137.071 | 137.071 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 310 | 674 | "NULL" | "Biological Tes... | 8 | "61001|61002|14... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
Definitions of molecular properties in this PubChem dataset
The definitions for some of the column names were shown below, which were mainly derived and adapted from PubChem:
Note: please refer to PubChem documentations for full definitions
Molecular weight - molecular mass of compounds measured in daltons
Topological polar surface area - measured as an estimate of polar surface area of a molecule (i.e. the surface sum over polar atoms in a molecule), with units in angstrom squared (Å2)
Complexity - complexity rating for compounds, based on Bertz/Hendrickson/Ihlenfeldt formula as a rough estimation of how complex a compound was structurally
Partition coefficients (xlogp) - predicted octanol-water partition coefficient as a measure of the hydrophilicity or hydrophobicity of a molecule
Heavy atom count - number of heavy atoms e.g. non-hydrogen atoms in the compound
Hydrogen bond donor count - number of hydrogen bond donors in the compound
Hydrogen bond acceptor count - number of hydrogen bond acceptors in the compound
Rotatable bond count - defined as any single-order non-ring bond, where atoms on either side of the bond were in turn bound to non-terminal heavy atoms (e.g. non-hydrogen). Rotation around the bond axis would change overall molecule shape and generate conformers which could be distinguished by standard spectroscopic methods
Exact mass - exact mass of an isotopic species, obtained by summing masses of individual isotopes of the molecule
Monoisotopic mass - sum of the masses of atoms in a molecule, using unbound, ground-state, rest mass of principal (or most abundant) isotope for each element instead of isotopic average mass
Formal charge - the difference between the number of valence electrons of each atom, and the number of electrons the atom was associated with, assumed any shared electrons were equally shared between the two bonded atoms
Covalently-bonded unit count - a group of atoms connected by covalent bonds, ignoring other bond types (or a single atom without covalent bonds), representing number of such units in the compound
Isotope atom count - number of isotopes that were not most abundant for the corresponding chemical elements. Isotopes were variants of a chemical element that differed in neutron number
Defined atom stereocenter count - atom stereocenter (or chiral center) was where an atom was attached to 4 different types of atoms or groups of atoms in a tetrahedral arrangement. It could either be (R)- or (S)- configurations. Some of the compounds e.g. racemic mixtures, could have undefined atom stereocenter, where (R/S)-config was not specifically defined. Defined atom stereocenter count was the number of atom stereocenters where configurations were specifically defined
Undefined atoms stereocenter count - this was the undefined version of the atoms stereocenter count
Defined bond stereocenter count - bond stereocenter (or non-rotatable bond) was where two atoms could have different arrangement e.g. in cis- & trans- forms of butene around its double bond. Some compounds could have an undefined bond stereocenter (stereochemistry not specifically defined). Defined bond stereocenter count was the number of bond stereocenters where configurations were specifically defined.
Undefined bond stereocenter count - this was the undefined version of the bond stereocenter count
Convert data type for selected columns
# Convert data type - only for partition coefficients column (rest were okay)
= pc_cov.with_column((pl.col("Partition coefficients")).cast(pl.Float64, strict = False))
pc_cov pc_cov.head()
cid | Compound name | Synonyms | Molecular weight | Molecular formula | Polar surface area | Complexity | Partition coefficients | Heavy atom count | Hydrogen bond donor count | Hydrogen bond acceptor count | Rotatable bond count | inchi | isosmiles | canonicalsmiles | inchikey | iupacname | Exact mass | Monoisotopic mass | Formal charge | Covalently-bonded unit count | Isotope atom count | Total atom stereocenter count | Defined atom stereocenter count | Undefined atoms stereocenter count | Total bond stereocenter count | Defined bond stereocenter count | Undefined bond stereocenter count | pclidcnt | gpidcnt | MeSH headings | annothits | annothitcnt | aids | cidcdate | sidsrcname | depcatg | annotation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | f64 | str | f64 | f64 | f64 | i64 | i64 | i64 | i64 | str | str | str | str | str | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | i64 | str | i64 | str | str | str |
5280453 | "Calcitriol" | "calcitriol|322... | 416.6 | "C27H44O3" | 60.7 | 688.0 | 5.1 | 30 | 3 | 3 | 6 | "InChI=1S/C27H4... | "C[C@H](CCCC(C)... | "CC(CCCC(C)(C)O... | "GMRQFYUYWCNGIN... | "(1R,3S,5Z)-5-[... | 416.329 | 416.329 | 0 | 1 | 0 | 6 | 6 | 0 | 2 | 2 | 0 | 22311 | 46029 | "Calcitriol" | "Biological Tes... | 12 | "485|631|731|78... | 20040916 | "A2B Chem|AA BL... | "Chemical Vendo... | "COVID-19, COVI... |
9962735 | "Ubiquinol" | "ubiquinol|992-... | 865.4 | "C59H92O4" | 58.9 | 1600.0 | 20.2 | 63 | 2 | 4 | 31 | "InChI=1S/C59H9... | "CC1=C(C(=C(C(=... | "CC1=C(C(=C(C(=... | "QNTNKSLOFHEFPK... | "2-[(2E,6E,10E,... | 864.7 | 864.7 | 0 | 1 | 0 | 0 | 0 | 0 | 9 | 9 | 0 | 2732 | 21358 | "NULL" | "Chemical and P... | 7 | "NULL" | 20061025 | "001Chemical|A2... | "Chemical Vendo... | "COVID-19, COVI... |
5961 | "Glutamine" | "L-glutamine|gl... | 146.14 | "C5H10N2O3" | 106.0 | 146.0 | -3.1 | 10 | 3 | 4 | 4 | "InChI=1S/C5H10... | "C(CC(=O)N)[C@@... | "C(CC(=O)N)C(C(... | "ZDXPYRJPNDTMRX... | "(2S)-2,5-diami... | 146.069 | 146.069 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 88218 | 399 | "Glutamine" | "Biological Tes... | 12 | "422|429|436|54... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
2244 | "Aspirin" | "aspirin|ACETYL... | 180.16 | "C9H8O4" | 63.6 | 212.0 | 1.2 | 13 | 1 | 4 | 3 | "InChI=1S/C9H8O... | "CC(=O)OC1=CC=C... | "CC(=O)OC1=CC=C... | "BSYNRYMUTXBXSQ... | "2-acetyloxyben... | 180.042 | 180.042 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 127012 | 364455 | "Aspirin" | "Biological Tes... | 12 | "1|3|9|15|19|21... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
457 | "1-Methylnicoti... | "1-methylnicoti... | 137.16 | "C7H9N2O+" | 47.0 | 136.0 | -0.1 | 10 | 1 | 1 | 1 | "InChI=1S/C7H8N... | "C[N+]1=CC=CC(=... | "C[N+]1=CC=CC(=... | "LDHMAVIPBRSVRG... | "1-methylpyridi... | 137.071 | 137.071 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 310 | 674 | "NULL" | "Biological Tes... | 8 | "61001|61002|14... | 20040916 | "001Chemical|3B... | "Chemical Vendo... | "COVID-19, COVI... |
Select columns for data visualisations
The idea was really only keeping all the numerical columns for some data visualisations later. So I’ve dropped all the other columns in texts or of the string types.
# Drop unused columns in preparation for data visualisations
= pc_cov.drop([
pc_cov "cid",
"Synonyms",
"Molecular formula",
"inchi",
"isosmiles",
"canonicalsmiles",
"inchikey",
"iupacname",
"pclidcnt",
"gpidcnt",
"MeSH headings",
"annothits",
"annothitcnt",
"aids",
"cidcdate",
"sidsrcname",
"depcatg",
"annotation"
])
pc_cov.head()
Compound name | Molecular weight | Polar surface area | Complexity | Partition coefficients | Heavy atom count | Hydrogen bond donor count | Hydrogen bond acceptor count | Rotatable bond count | Exact mass | Monoisotopic mass | Formal charge | Covalently-bonded unit count | Isotope atom count | Total atom stereocenter count | Defined atom stereocenter count | Undefined atoms stereocenter count | Total bond stereocenter count | Defined bond stereocenter count | Undefined bond stereocenter count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
"Calcitriol" | 416.6 | 60.7 | 688.0 | 5.1 | 30 | 3 | 3 | 6 | 416.329 | 416.329 | 0 | 1 | 0 | 6 | 6 | 0 | 2 | 2 | 0 |
"Ubiquinol" | 865.4 | 58.9 | 1600.0 | 20.2 | 63 | 2 | 4 | 31 | 864.7 | 864.7 | 0 | 1 | 0 | 0 | 0 | 0 | 9 | 9 | 0 |
"Glutamine" | 146.14 | 106.0 | 146.0 | -3.1 | 10 | 3 | 4 | 4 | 146.069 | 146.069 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
"Aspirin" | 180.16 | 63.6 | 212.0 | 1.2 | 13 | 1 | 4 | 3 | 180.042 | 180.042 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
"1-Methylnicoti... | 137.16 | 47.0 | 136.0 | -0.1 | 10 | 1 | 1 | 1 | 137.071 | 137.071 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Quick summary statistics of columns
# Overall descriptive statistics of kept columns
pc_cov.describe()
describe | Compound name | Molecular weight | Polar surface area | Complexity | Partition coefficients | Heavy atom count | Hydrogen bond donor count | Hydrogen bond acceptor count | Rotatable bond count | Exact mass | Monoisotopic mass | Formal charge | Covalently-bonded unit count | Isotope atom count | Total atom stereocenter count | Defined atom stereocenter count | Undefined atoms stereocenter count | Total bond stereocenter count | Defined bond stereocenter count | Undefined bond stereocenter count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"count" | "631" | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 | 631.0 |
"null_count" | "0" | 0.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"mean" | null | 549.539675 | 163.915368 | 864.755626 | 2.25917 | 37.770206 | 4.066561 | 9.210777 | 9.518225 | 549.095022 | 549.06013 | -0.004754 | 1.578447 | 0.006339 | 4.017433 | 3.551506 | 0.465927 | 0.381933 | 0.343899 | 0.038035 |
"std" | null | 455.236826 | 192.256415 | 1000.220379 | 3.926459 | 31.821967 | 6.348004 | 8.694184 | 15.393131 | 455.064211 | 454.958033 | 0.358537 | 1.610416 | 0.079429 | 6.128363 | 5.787792 | 2.364089 | 1.181171 | 1.107245 | 0.363159 |
"min" | "(+)-Mefloquine... | 103.1 | 0.0 | 0.0 | -24.0 | 1.0 | 0.0 | 0.0 | 0.0 | 103.04 | 103.04 | -6.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
"max" | "sodium;8-amino... | 4114.0 | 1650.0 | 9590.0 | 20.2 | 291.0 | 57.0 | 65.0 | 151.0 | 4112.12 | 4111.12 | 2.0 | 21.0 | 1.0 | 39.0 | 39.0 | 31.0 | 11.0 | 11.0 | 7.0 |
"median" | null | 435.9 | 110.0 | 635.0 | 2.5 | 30.0 | 3.0 | 7.0 | 6.0 | 435.227 | 435.227 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Conditional assignments in Polars
The longer I’ve used Polars, the more I like its coding styles of chaining a string of different code functions together to manipulate dataframes in one go. This usually might mean that we could avoid writing some repeated loop functions to achieve the same results. In the example below, I’d like to show how to chain “when-then-otherwise” expressions by using Polars.
Chaining when-then-otherwise expressions - creating groups in data
I had the idea of separating all data into 3 different ranges of partition coefficients, so that this could be shown visually in plots. One of the possible ways (other than writing a loop function), or really the long way, to do this might be like the code shown below:
```{python}
= pc_cov.filter(pl.col("Partition_coef") <= -10)
part_coef_1 = pc_cov.filter((pl.col("Partition_coef") >= -11) & (pl.col("Partition_coef") <= 5))
part_coef_2 = pc_cov.filter(pl.col("Partition_coef") >= 6)
part_coef_3 ```
A shorter and probably more elegant way was to use the “when-then-otherwise” expression in Polars for conditional assignments (the following code snippet was adapted with thanks to the author of Polars, Ritchie Vink and also the good old Stack Overflow):
= pc_cov.with_column(
pc_cov "Partition coefficients") <= -10))
pl.when((pl.col("Smaller than -10")
.then("Partition coefficients") >= -11) & (pl.col("Partition coefficients") <= 5))
.when((pl.col("Between -11 and 5")
.then("Larger than 6")
.otherwise("Part_coef_group")
.alias(
)
10)
pc_cov.head(
# a new column would be added to the end of the dataframe
# with a new column name, "Part_coef_group"
# (scroll to the very right to see the added column)
Compound name | Molecular weight | Polar surface area | Complexity | Partition coefficients | Heavy atom count | Hydrogen bond donor count | Hydrogen bond acceptor count | Rotatable bond count | Exact mass | Monoisotopic mass | Formal charge | Covalently-bonded unit count | Isotope atom count | Total atom stereocenter count | Defined atom stereocenter count | Undefined atoms stereocenter count | Total bond stereocenter count | Defined bond stereocenter count | Undefined bond stereocenter count | Part_coef_group |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | i64 | i64 | i64 | i64 | f64 | f64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str |
"Calcitriol" | 416.6 | 60.7 | 688.0 | 5.1 | 30 | 3 | 3 | 6 | 416.329 | 416.329 | 0 | 1 | 0 | 6 | 6 | 0 | 2 | 2 | 0 | "Larger than 6" |
"Ubiquinol" | 865.4 | 58.9 | 1600.0 | 20.2 | 63 | 2 | 4 | 31 | 864.7 | 864.7 | 0 | 1 | 0 | 0 | 0 | 0 | 9 | 9 | 0 | "Larger than 6" |
"Glutamine" | 146.14 | 106.0 | 146.0 | -3.1 | 10 | 3 | 4 | 4 | 146.069 | 146.069 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | "Between -11 an... |
"Aspirin" | 180.16 | 63.6 | 212.0 | 1.2 | 13 | 1 | 4 | 3 | 180.042 | 180.042 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | "Between -11 an... |
"1-Methylnicoti... | 137.16 | 47.0 | 136.0 | -0.1 | 10 | 1 | 1 | 1 | 137.071 | 137.071 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | "Between -11 an... |
"Losartan" | 422.9 | 92.5 | 520.0 | 4.3 | 30 | 2 | 5 | 8 | 422.162 | 422.162 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | "Between -11 an... |
"Vitamin E" | 430.7 | 29.5 | 503.0 | 10.7 | 31 | 1 | 2 | 12 | 430.381 | 430.381 | 0 | 1 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | "Larger than 6" |
"Nicotinamide" | 122.12 | 56.0 | 114.0 | -0.4 | 9 | 1 | 2 | 1 | 122.048 | 122.048 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | "Between -11 an... |
"Adenosine" | 267.24 | 140.0 | 335.0 | -1.1 | 19 | 4 | 8 | 2 | 267.097 | 267.097 | 0 | 1 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | "Between -11 an... |
"Inosine" | 268.23 | 129.0 | 405.0 | -1.3 | 19 | 4 | 7 | 2 | 268.081 | 268.081 | 0 | 1 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | "Between -11 an... |
Import Plotly
Time for some data vizzes - importing Plotly first.
import plotly.express as px
Some examples of data visualisations
Below were some of the examples of building plots by using Plotly.
Partition coefficients vs. Molecular weights
= px.scatter(x = pc_cov["Partition coefficients"],
fig = pc_cov["Molecular weight"],
y = pc_cov["Compound name"],
hover_name = pc_cov["Part_coef_group"],
color = 800,
width = 400,
height = "Partition coefficients vs. molecular weights for compounds used in COVID-19 clinical trials")
title
fig.update_layout(= dict(
title = dict(
font = 15)),
size = 0.5,
title_x = dict(
margin = 20, r = 20, t = 40, b = 3),
l = dict(
xaxis = dict(size = 9),
tickfont = "Partition coefficients"
title
),= dict(
yaxis = dict(size = 9),
tickfont = "Molecular weights"
title
),= dict(
legend = dict(
font = 9)))
size
fig.show()
Molecular weights vs. Complexity
= px.scatter(x = pc_cov["Molecular weight"],
fig = pc_cov["Complexity"],
y = pc_cov["Compound name"],
hover_name #color = pc_cov["Part_coef_group"],
= 800,
width = 400,
height = "Molecular weights vs. complexity for compounds used in COVID-19 clinical trials")
title
fig.update_layout(= dict(
title = dict(
font = 15)),
size = 0.5,
title_x = dict(
margin = 20, r = 20, t = 40, b = 3),
l = dict(
xaxis = dict(size = 9),
tickfont = "Molecular weights"
title
),= dict(
yaxis = dict(size = 9),
tickfont = "Complexity"
title
),= dict(
legend = dict(
font = 9)))
size
fig.show()
Export prepared dataset
Two of the possible options to export the dataset for use in a Shiny app could be:
Convert Polars dataframe into a Pandas dataframe, so that it could be imported into the app for use (Polars not directly supported in Shiny for Python yet, but we could use its to_pandas() function to coerce an object e.g. a dataframe to be converted into a Pandas dataframe).
Another option was to save Polars dataframe as .csv file, then read in this file in the app.py script by using Pandas (which was the method I used for this particular app)
```{python}
# --If preferring to use Pandas--
# Convert Polars df into a Pandas df if needed
= df_name.to_pandas()
df_name
# Convert the Pandas df into a csv file using Pandas
"csv_file_name.csv", sep = ",")
df_name.to_csv(
# --If preferring to use Polars--
# Simply write a Polars dataframe into a .csv file
"csv_file_name.csv", separator = ",")
df_name.write_csv(```