Pills dataset - Part 2

Text cleaning using Polars & visualising pills with Plotly

Data analytics projects
Pills dataset series
Polars
Python
Plotly
Jupyter
Author

Jennifer HY Lin

Published

January 31, 2023

Quick overview

Part 2 of this project aimed to look at the pills data up close, particularly into the types of dosage forms, colours, shapes and inactive excipients used in oral medications. Plotly was used as the main data visualisation library, which was followed by some text cleaning for a particularly busy column in the dataset. This was then completed with a section in the end to generate a small dataframe, preparing for a simple data visualisation in Rust-Evcxr for the final part of this project (part 3).


Photo by Myriam Zilles on Unsplash


Import libraries and pills dataset
import polars as pl
import plotly.express as px

The pills.csv file saved from part 1 was imported as shown below.

df = pl.read_csv("pills.csv")
df
shape: (83925, 5)
splshape_text splcolor_text spl_strength spl_inactive_ing DosageForm
str str str str str
"CAPSULE" "PINK" "TEMAZEPAM 15 m... "SILICON DIOXID... "CAPSULE"
"ROUND" "ORANGE" "IBUPROFEN 200 ... "SILICON DIOXID... "TABLET, FILM C...
"PENTAGON (5 SI... "GREEN" "DEXAMETHASONE ... "ANHYDROUS LACT... "TABLET"
"ROUND" "WHITE" "Nickel Sulfate... null "TABLET"
"ROUND" "WHITE" "CLONAZEPAM 0.2... "SORBITOL;ASPAR... "TABLET, ORALLY...
"ROUND" "WHITE" "SILDENAFIL CIT... "ANHYDROUS DIBA... "TABLET, FILM C...
"OVAL" "YELLOW" "RISPERIDONE 3 ... "LACTOSE MONOHY... "TABLET, FILM C...
"CAPSULE" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"ROUND" "WHITE" "Iloperidone 12... "silicon dioxid... "TABLET"
"CAPSULE" "YELLOW;WHITE" "FENOPROFEN CAL... "CROSPOVIDONE;M... "CAPSULE"
"ROUND" "YELLOW" "BUTALBITAL 50 ... "STARCH, CORN;C... "TABLET"
"ROUND" "WHITE" "ESTRADIOL 0.5 ... "COPOVIDONE K25... "TABLET"
... ... ... ... ...
"ROUND" "WHITE" "MEMANTINE HYDR... "SILICON DIOXID... "TABLET"
"CAPSULE" "ORANGE" "ACETAMINOPHEN ... "BUTYLATED HYDR... "CAPSULE, LIQUI...
"ROUND" "WHITE" "LAMOTRIGINE 25... "MAGNESIUM CARB... "TABLET, CHEWAB...
"OVAL" "BLUE" "ACETAMINOPHEN ... "ACESULFAME POT... "TABLET, COATED...
"OVAL" "WHITE" "AZITHROMYCIN D... "CROSCARMELLOSE... "TABLET, FILM C...
"OVAL" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"OVAL" "WHITE" "CETIRIZINE HYD... "STARCH, CORN;H... "TABLET"
"OVAL" "BROWN" "OMEPRAZOLE 20 ... "CARNAUBA WAX;F... "TABLET, DELAYE...
"ROUND" "PINK;ORANGE;YE... "CALCIUM CARBON... "CITRIC ACID MO... "TABLET, CHEWAB...
"OVAL" "GREEN" "ACETAMINOPHEN ... "STARCH, CORN;D... "TABLET, FILM C...
"CAPSULE" "BLUE" "Amlodipine bes... "Cellulose, mic... "CAPSULE"
"ROUND" "ORANGE" "DARIFENACIN 15... "ANHYDROUS DIBA... "TABLET, EXTEND...


Change column names

Again, column names were changed to something easier to read.

# Rename all column names
df_new = df.rename({"splcolor_text": "Colour", 
                    "splshape_text": "Shape", 
                    "spl_strength": "Drug_strength", 
                    "spl_inactive_ing": "Inactive_excipients", 
                    "DosageForm": "Dosage_form"}
                  )
df_new
shape: (83925, 5)
Shape Colour Drug_strength Inactive_excipients Dosage_form
str str str str str
"CAPSULE" "PINK" "TEMAZEPAM 15 m... "SILICON DIOXID... "CAPSULE"
"ROUND" "ORANGE" "IBUPROFEN 200 ... "SILICON DIOXID... "TABLET, FILM C...
"PENTAGON (5 SI... "GREEN" "DEXAMETHASONE ... "ANHYDROUS LACT... "TABLET"
"ROUND" "WHITE" "Nickel Sulfate... null "TABLET"
"ROUND" "WHITE" "CLONAZEPAM 0.2... "SORBITOL;ASPAR... "TABLET, ORALLY...
"ROUND" "WHITE" "SILDENAFIL CIT... "ANHYDROUS DIBA... "TABLET, FILM C...
"OVAL" "YELLOW" "RISPERIDONE 3 ... "LACTOSE MONOHY... "TABLET, FILM C...
"CAPSULE" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"ROUND" "WHITE" "Iloperidone 12... "silicon dioxid... "TABLET"
"CAPSULE" "YELLOW;WHITE" "FENOPROFEN CAL... "CROSPOVIDONE;M... "CAPSULE"
"ROUND" "YELLOW" "BUTALBITAL 50 ... "STARCH, CORN;C... "TABLET"
"ROUND" "WHITE" "ESTRADIOL 0.5 ... "COPOVIDONE K25... "TABLET"
... ... ... ... ...
"ROUND" "WHITE" "MEMANTINE HYDR... "SILICON DIOXID... "TABLET"
"CAPSULE" "ORANGE" "ACETAMINOPHEN ... "BUTYLATED HYDR... "CAPSULE, LIQUI...
"ROUND" "WHITE" "LAMOTRIGINE 25... "MAGNESIUM CARB... "TABLET, CHEWAB...
"OVAL" "BLUE" "ACETAMINOPHEN ... "ACESULFAME POT... "TABLET, COATED...
"OVAL" "WHITE" "AZITHROMYCIN D... "CROSCARMELLOSE... "TABLET, FILM C...
"OVAL" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"OVAL" "WHITE" "CETIRIZINE HYD... "STARCH, CORN;H... "TABLET"
"OVAL" "BROWN" "OMEPRAZOLE 20 ... "CARNAUBA WAX;F... "TABLET, DELAYE...
"ROUND" "PINK;ORANGE;YE... "CALCIUM CARBON... "CITRIC ACID MO... "TABLET, CHEWAB...
"OVAL" "GREEN" "ACETAMINOPHEN ... "STARCH, CORN;D... "TABLET, FILM C...
"CAPSULE" "BLUE" "Amlodipine bes... "Cellulose, mic... "CAPSULE"
"ROUND" "ORANGE" "DARIFENACIN 15... "ANHYDROUS DIBA... "TABLET, EXTEND...


Visualising oral dosage forms & colours in pills

Grabbing only unique drugs in the dataset to minimise duplications.

df_viz = df_new.unique(subset = "Drug_strength")
df_viz
shape: (9287, 5)
Shape Colour Drug_strength Inactive_excipients Dosage_form
str str str str str
"CAPSULE" "PINK" "TEMAZEPAM 15 m... "SILICON DIOXID... "CAPSULE"
"ROUND" "ORANGE" "IBUPROFEN 200 ... "SILICON DIOXID... "TABLET, FILM C...
"PENTAGON (5 SI... "GREEN" "DEXAMETHASONE ... "ANHYDROUS LACT... "TABLET"
"ROUND" "WHITE" "Nickel Sulfate... null "TABLET"
"ROUND" "WHITE" "CLONAZEPAM 0.2... "SORBITOL;ASPAR... "TABLET, ORALLY...
"ROUND" "WHITE" "SILDENAFIL CIT... "ANHYDROUS DIBA... "TABLET, FILM C...
"OVAL" "YELLOW" "RISPERIDONE 3 ... "LACTOSE MONOHY... "TABLET, FILM C...
"ROUND" "WHITE" "Iloperidone 12... "silicon dioxid... "TABLET"
"CAPSULE" "YELLOW;WHITE" "FENOPROFEN CAL... "CROSPOVIDONE;M... "CAPSULE"
"ROUND" "YELLOW" "BUTALBITAL 50 ... "STARCH, CORN;C... "TABLET"
"ROUND" "WHITE" "ESTRADIOL 0.5 ... "COPOVIDONE K25... "TABLET"
"CAPSULE" "YELLOW" "BENZONATATE 20... "D&C YELLOW NO.... "CAPSULE"
... ... ... ... ...
"ROUND" "WHITE" "BUSULFAN 2 mg" "HYPROMELLOSES ... "TABLET, FILM C...
"OVAL" "ORANGE" "AMPHETAMINE SU... "CELLULOSE, MIC... "TABLET"
"ROUND" "GREEN" "FOLIC ACID 800... "MICROCRYSTALLI... "TABLET"
"ROUND" "WHITE" "ACONITUM NAPEL... "LACTOSE / MAGN... "TABLET"
"ROUND" "WHITE" "MATRICARIA CHA... "ACACIA;LACTOSE... "TABLET"
"OVAL" "RED" "FOLIC ACID 1 m... "CELLULOSE, MIC... "TABLET"
"ROUND" "BLUE" "Labetalol 300 ... "ANHYDROUS LACT... "TABLET, FILM C...
"CAPSULE" "YELLOW;BLUE" "POMALIDOMIDE 1... "MANNITOL;STARC... "CAPSULE"
"CAPSULE" "PURPLE" "CALCIUM CITRAT... "GELATIN;" "CAPSULE, GELAT...
"CAPSULE" "GREEN;YELLOW" "LENALIDOMIDE 1... "ANHYDROUS LACT... "CAPSULE"
"OVAL" "WHITE" "folic acid 1 m... "cellulose, mic... "TABLET"
"CAPSULE" "BLUE" "Amlodipine bes... "Cellulose, mic... "CAPSULE"

One way to avoid switching Polars dataframe to a Pandas one, which could be one of the options to plot data from Polars dataframes in Plotly, was to call the x-axis and y-axis data directly from the dataframe as shown in the code below.

# scatter plot for colours, dosage forms & drug strengths 
fig = px.scatter(x = df_viz["Colour"], 
                 y = df_viz["Dosage_form"], 
                 color = df_viz["Colour"],
                 hover_name = df_viz["Drug_strength"],
                 width = 900, 
                 height = 400,
                 title = "Oral dosage forms and colours of pills")

# Update layout of the plot
fig.update_layout(
    # Change title font size
    title = dict(
        font = dict(
            size = 15)),
    # Centre the title
    title_x = 0.5,
    # Edit margins
    margin = dict(
        l = 20, r = 20, t = 40, b = 3),
    # Change x-axis
    xaxis = dict(
        tickfont = dict(size = 9), 
        title = "Colours"
    ),
    # Change y-axis
    yaxis = dict(
        tickfont = dict(size = 9), 
        title = "Dosage forms"
    ),
    # Edit lengend font size
    legend = dict(
        font = dict(
            size = 9)))

fig.show()


White was the most common colour, especially after zooming in the plot. Capsule was very commonly used as the oral dosage form of choice in this dataset.


Visualising shapes & colours in pills
fig = px.scatter(x = df_viz["Colour"], 
                 y = df_viz["Shape"], 
                 color = df_viz["Colour"],
                 hover_name = df_viz["Drug_strength"],
                 width = 900, 
                 height = 400,
                 title = "Shapes and colours of pills")

# Update layout of the plot
fig.update_layout(
    # Change title font size
    title = dict(
        font = dict(
            size = 15)),
    # Centre the title
    title_x = 0.5,
    # Edit margins
    margin = dict(
        l = 20, r = 20, t = 40, b = 3),
    # Change x-axis
    xaxis = dict(
        tickfont = dict(size = 9), 
        title = "Colours"
    ),
    # Change y-axis
    yaxis = dict(
        tickfont = dict(size = 9), 
        title = "Shapes"
    ),
    # Edit lengend font size
    legend = dict(
        font = dict(
            size = 9)))

fig.show()


Capsule was again the most common oral dosage shape used for pills in the dataset. Common colours included red, brown, blue, purple, pink, orange, green, white and yellow. Combination colours followed these common ones, which had a mixture of a variety of colours used simultaneously, likely to avoid confusions and errors in dispensings or administrations.


Visualising inactive excipients in pills

The messiest part of the data actually lied in the column of “Inactive_excipients”, with numerous different punctuations used inconsistently, such as forward slashes, commas and semi-colons. There were vast quantities of different inactive components used for oral dosage forms. Because of this, I had to spend a bit more time cleaning up the texts in order to find out what were the commonly used inactive ingredients in the end.

# Formulated a separate dataframe with just "Inactive_excipients"
df_ie = df_new.select([pl.col("Inactive_excipients")])
df_ie
shape: (83925, 1)
Inactive_excipients
str
"SILICON DIOXID...
"SILICON DIOXID...
"ANHYDROUS LACT...
null
"SORBITOL;ASPAR...
"ANHYDROUS DIBA...
"LACTOSE MONOHY...
"FD&C BLUE NO. ...
"silicon dioxid...
"CROSPOVIDONE;M...
"STARCH, CORN;C...
"COPOVIDONE K25...
...
"SILICON DIOXID...
"BUTYLATED HYDR...
"MAGNESIUM CARB...
"ACESULFAME POT...
"CROSCARMELLOSE...
"FD&C BLUE NO. ...
"STARCH, CORN;H...
"CARNAUBA WAX;F...
"CITRIC ACID MO...
"STARCH, CORN;D...
"Cellulose, mic...
"ANHYDROUS DIBA...
Text cleaning for inactive excipients column

To prepare this column for data visualisations, I used Polars’ string expressions (or more commonly known as regex - regular expressions) to try and tidy up the raw texts. When I did the text cleaning in Jupyter Lab initially, the line of code for .str.strip(” ,“) worked, but when I converted the .ipynb file into a .qmd (Quarto markdown) one, and used the same line, it failed to work due to the extra space in front of the comma. However, I got around the error by splitting it into two separate units as space and comma, and it worked without problem. One possible reason would be due to the reticulate package needed to run Python in RStudio IDE, and how Polars dataframe library was relatively newer than Pandas dataframe library, which meant certain features in Polars might not have been taken on board in the reticulate package (only my guess).

# Clean string texts 
# Convert uppercase letters into lowercase ones in the excipients column
df_de = (df_ie.with_column(pl.col("Inactive_excipients").str.to_lowercase(
    # replace old punctuations (1st position) with new one (2nd position)
    ).str.replace_all(
        ";", ", "
    ).str.replace_all(
        " /", ", "
    ).str.replace_all(
        "/", ", "
    # Remove extra space & comma by stripping
    # In Jupyter notebook/lab - can combine space & comma: .str.strip(" ,")
    # For RStudio IDE - separate into two for this to work
    ).str.strip(
        " "
    ).str.strip(
        ","
    # Split the texts by the specified punctuation e.g. comma with space
    ).str.split(
        by = ", "
    # Create a new column with a new name
    ).alias(
        "Inactive"
    )
# Explode the splitted texts into separate rows within the new column
).explode(
    "Inactive"
)
)

df_de
shape: (840029, 2)
Inactive_excipients Inactive
str str
"SILICON DIOXID... "silicon dioxid...
"SILICON DIOXID... "edetate disodi...
"SILICON DIOXID... "lactose monohy...
"SILICON DIOXID... "magnesium stea...
"SILICON DIOXID... "cellulose"
"SILICON DIOXID... "microcrystalli...
"SILICON DIOXID... "starch"
"SILICON DIOXID... "corn"
"SILICON DIOXID... "sodium lauryl ...
"SILICON DIOXID... "fd&c blue no. ...
"SILICON DIOXID... "fd&c red no. 4...
"SILICON DIOXID... "gelatin"
... ...
"Cellulose, mic... "shellac"
"Cellulose, mic... "propylene glyc...
"Cellulose, mic... "ammonia"
"Cellulose, mic... "fd&c blue no. ...
"ANHYDROUS DIBA... "anhydrous diba...
"ANHYDROUS DIBA... "ferric oxide r...
"ANHYDROUS DIBA... "hypromelloses"
"ANHYDROUS DIBA... "polyethylene g...
"ANHYDROUS DIBA... "magnesium stea...
"ANHYDROUS DIBA... "titanium dioxi...
"ANHYDROUS DIBA... "talc"
"ANHYDROUS DIBA... "ferric oxide y...
# Quick look at the dataframe to see before and after text cleaning
print(df_de.glimpse())
Rows: 840029
Columns: 2
$ Inactive_excipients <Utf8> SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;
$ Inactive            <Utf8> silicon dioxide, edetate disodium, lactose monohydrate, magnesium stearate, cellulose, microcrystalline, starch, corn, sodium lauryl sulfate, fd&c blue no. 1

As shown above, the “Inactive_excipients” column was the original column for excipients, where the second column named, “Inactive” was the new column shown after the punctuation tidy-ups, string strip and row text explosion. The excipients were broken down into individual terms, rather than in massively long strings which might not make sense to some readers.

# Re-organise the dataframe to choose the cleaned "Inactive" column
df_final = df_de.select(["Inactive"])
df_final
shape: (840029, 1)
Inactive
str
"silicon dioxid...
"edetate disodi...
"lactose monohy...
"magnesium stea...
"cellulose"
"microcrystalli...
"starch"
"corn"
"sodium lauryl ...
"fd&c blue no. ...
"fd&c red no. 4...
"gelatin"
...
"shellac"
"propylene glyc...
"ammonia"
"fd&c blue no. ...
"anhydrous diba...
"ferric oxide r...
"hypromelloses"
"polyethylene g...
"magnesium stea...
"titanium dioxi...
"talc"
"ferric oxide y...
# Remove all cells with null values
df_final = df_final.drop_nulls()
# Group the data by different inactive excipients with counts shown
df_final = df_final.groupby("Inactive").agg(pl.count())
df_final.head()
shape: (5, 2)
Inactive count
str u32
"low-substitute... 4
"sodium starch ... 118
" glyceryl dibe... 3
"aluminum chlor... 27
"mentha piperit... 7
Inactive excipient counts
# Count each excipient and cast the whole column into integers
df_final = df_final.with_column((pl.col("count")).cast(pl.Int64, strict = False))
df_final
shape: (1674, 2)
Inactive count
str i64
"low-substitute... 4
"sodium starch ... 118
" glyceryl dibe... 3
"aluminum chlor... 27
"mentha piperit... 7
"epimedium gran... 1
" ethyl acetate... 2
"rutin" 1
"methacrylic ac... 2106
" calcium phosp... 12
"carbomer homop... 28
" tocopherol" 2
... ...
"methylcellulos... 62
"carbomer homop... 27
" red ferric ox... 3
"anhydrous lact... 4
"sorbic acid" 195
"ilex pedunculo... 2
" aminobenzoic ... 1
"polyvinyl alco... 55
"3-hexenyl acet... 4
"methacrylic ac... 2
"dihydroxyalumi... 2
"hydroxypropyl ... 46
Overview of inactive excipients used in oral dosage forms
fig = px.scatter(x = df_final["Inactive"], 
                 y = df_final["count"], 
                 hover_name = df_final["Inactive"],
                 title = "Inactive excipients and their respective counts in pills")

fig.update_layout(
    title = dict(
        font = dict(
            size = 15)),
    title_x = 0.5,
    margin = dict(
        l = 20, r = 20, t = 40, b = 10),
    xaxis = dict(
        tickfont = dict(size = 9), 
        title = "Inactive excipients"
    ),
    yaxis = dict(
        tickfont = dict(size = 9), 
        title = "Counts"
    ),
    legend = dict(
        font = dict(
            size = 9)))


fig.show()
Frequently used inactive excipients
# Re-order the excipients with counts in descending order
# Filter out only the ones with counts over 10,000
df_ex = df_final.sort("count", reverse = True).filter((pl.col("count")) >= 10000)
df_ex.head()
shape: (5, 2)
Inactive count
str i64
"magnesium stea... 58908
"titanium dioxi... 43241
"unspecified" 35744
"silicon dioxid... 34037
"starch" 32501
fig = px.bar(x = df_ex["Inactive"], 
             y = df_ex["count"], 
             color = df_ex["Inactive"],
             title = "Commonly used inactive excipients in pills")

fig.update_layout(
    title = dict(
        font = dict(
            size = 15)),
    title_x = 0.5,
    margin = dict(
        l = 10, r = 10, t = 40, b = 5),
    xaxis = dict(
        tickfont = dict(size = 9), 
        title = "Inactive excipients"
    ),
    yaxis = dict(
        tickfont = dict(size = 9), 
        title = "Counts"
    ),
    legend = dict(
        font = dict(
            size = 9)))

fig.show()


The text cleaning might not be perfect at this stage, but I think I’ve managed to get most of the core texts cleaned into a more sensible and readable formats. From what I’ve worked out here, the most frequently used inactive ingredient was magnesium stearate, which was followed by titanium dioxide, and then interestingly “unspecified”, which was exactly how it was documented in the original pillbox dataset at the beginning. I didn’t go further digging into what this “unspecified” inactive excipients might be, as in whether it meant it in a singular or plural forms. So this still remained a mystery at this stage, but if all these oral medications were FDA-approved, we would’ve hoped each and everyone of these pills would be verified in safety, quality and effectiveness before they entered into the market for wide prescriptions. In the worst case, each therapeutic drug should also have post-marketing surveillance, for long-term safety monitoring.


Create a small dataframe for data visualisation in Rust-Evcxr

All acetaminophens were filtered out in the “Drug_strength” column and all duplicates were removed in the dataset.

df_ac = df_new.filter(
    pl.col("Drug_strength")
    .str.starts_with("acetam")).unique(subset = ["Drug_strength"])

df_ac
shape: (13, 5)
Shape Colour Drug_strength Inactive_excipients Dosage_form
str str str str str
"CAPSULE" "RED" "acetaminophen ... "starch, corn /... "CAPSULE"
"OVAL" "WHITE" "acetaminophen ... "anhydrous citr... "TABLET, FILM C...
"ROUND" "WHITE" "acetaminophen ... "powdered cellu... "TABLET"
"ROUND" "WHITE" "acetaminophen ... "hydroxypropyl ... "TABLET"
"OVAL" "ORANGE" "acetaminophen ... null "TABLET, FILM C...
"ROUND" "BLUE" "acetaminophen ... "FD&C BLUE NO. ... "TABLET"
"ROUND" "YELLOW;WHITE" "acetaminophen ... "calcium steara... "TABLET"
"OVAL" "WHITE" "acetaminophen ... "carnauba wax;s... "TABLET, FILM C...
"ROUND" "WHITE" "acetaminophen ... "powdered cellu... "TABLET"
"CAPSULE" "ORANGE;BROWN" "acetaminophen ... "CALCIUM PHOSPH... "CAPSULE"
"OVAL" "WHITE" "acetaminophen ... "carnauba wax;H... "TABLET, FILM C...
"ROUND" "WHITE" "acetaminophen ... "calcium steara... "TABLET"
"OVAL" "BLUE" "acetaminophen ... "carnauba wax;C... "TABLET, FILM C...

I’ve opted for finding out the different types of colours with their respective counts in oral acetaminophen, or also known as paracetamol in some other countries.

df_ac = df_ac.groupby("Colour").agg(pl.count())
df_ac
shape: (6, 2)
Colour count
str u32
"ORANGE;BROWN" 1
"YELLOW;WHITE" 1
"WHITE" 7
"ORANGE" 1
"RED" 1
"BLUE" 2
fig = px.scatter(x = df_ac["Colour"], 
                 y = df_ac["count"], 
                 size = df_ac["count"], 
                 color = df_ac["Colour"],
                 title = "Frequency of colours in acetaminophen (paracetamol) oral dosage forms"
                )

fig.update_layout(
    xaxis = dict(
        title = "Colours"
    ), 
    yaxis = dict(
        title = "Counts"
    )
)

fig.show()


I’ve decided to keep the dataframe very simple for part 3 as my original intention was to trial plotting a graph in Evcxr only (nothing fancy at this stage), and also to gain some familiarities with Rust as another new programming language for me. Readers might notice that I’ve opted for a scatter plot in Plotly (in Python3 kernel) here for this last dataframe, and when we finally got to part 3 (hopefully coming soon as I needed to figure how to publish Rust code in Quarto…), I might very likely revert this to a bar graph (in Rust kernel), due to some technical issues (slow dependency loading, and somehow with Plotly.rs in Evcxr, the scatter graph looked more like scatter line graph instead… more stories to follow) and being a new Rust-Evcxr user. At the very least, I’ve kind of tried something I’ve planned for, although not looking very elegant yet, with rooms for improvements in the future.