Pills dataset - Part 1

Web scraping, Polars & Pandas dataframe libraries

Data analytics projects
Polars
Pandas
Python
Jupyter
Web-scraping
Author

Jennifer HY Lin

Published

January 21, 2023

Introduction

As mentioned in my last project, I’ve tried using Evcxr, which provided a way to use Rust interactively in a Jupyter environment. The name, “Evcxr”, was quite hard to remember at first. It was pronounced as “e-vic-ser” according to the author, which I’ve randomly come across in an online tech interview when I was looking into it. I’ve also sort of worked out a way to memorise its spelling by taking specific letters out of “evaluation context for rust” (which was what it was called in its GitHub repository).

For users of Jupyter Notebook/Lab and Python, they might be quite used to the working speed of the cell outputs. However, one thing I’ve noticed when I was using Evcxr or Rust kernel in Jupyter Lab was that the speed of cell outputs was noticeably slower (especially at the beginning while loading all the dependencies required). The speed improved when loading external crates and modules, and generally it was faster afterward.

Due to this reason (note: I did not look into any other optimising strategies for this and this could be restricted to my computer hardware specs, so this might differ for other users), I think Evcxr was not ideal for a very large and complex data science project yet (however if its ecosystem kept developing, it might be improved in the future). One thing of note was that when I was combing through issues in Evcxr’s GitHub repository, someone mentioned the slow compile time of the Rust compiler, which would have likely caused the snail speed, but knowing that the actual program running speed was blazingly fast, some sacrifice at the beginning made sense to me. Overall, Rust was really a systems programming language with memory efficiency (with no garbage collector), type safety and concurrency as some of its notable advantages.

Because of the dependency loading issue in the Jupyter environment, and also knowing there was already a dataframe library built from Rust, I’ve opted to use Polars-Python again for the data wrangling part of this project. This was also accompanied by the good old Pandas library as well (under the section of “Transform web-scraped data into dataframe” if anyone wants to jump to that part to see the codes). I then went on to trial using Rust via Evcxr for data visualisation based on a small dataframe by using Plotly.rs. This project would be separated into 3 parts:

  • Part 1: Initial pillbox dataset loading and web-scraping
  • Part 2: Data wrangling and mining for data visualisations
  • Part 3: Using Rust for data visualisation

The main reason I wanted to try Evcxr was that I could see the potential of using Rust interactively to showcase the results in a relatively fast and efficient manner. This meant specific data exploratory results could reach wider audience, leading to more impacts in different fields, in a very broad term. Oppositely, for more specific users such as scientists or engineers, this meant experiments could be carried out in a safe and efficient manner, with test results readily available for future work planning.


Download dataset

This time the dataset was spotted from Data Is Plural, specifically the 2022.11.30 edition. The section I was interested in was the first paragraph at the top, about “Pills”. By going into one of the links provided in the paragraph, this brought me to the Pillbox dataset from the US National Library of Medicine (NLM). The .csv file was downloaded via the “Export” button at the top right of the webpage.

This pillbox dataset was actually retired since 28th January 2021, but was still available for educational or research purposes only. Therefore, it was not recommended for pill identifications as the dataset was not up-to-date. Alternative resources such as DailyMed would be more appropriate for readers in the US (as one of the examples). For readers in other countries, local health professionals and resources would be recommended for up-to-date information.


Importing library & dataset

# Install/upgrade polars if needed (uncomment the line below)
#pip install --upgrade polars
import polars as pl
# Check version of polars (uncomment line below)
#pl.show_versions()
df = pl.read_csv("pillbox.csv", ignore_errors = True)
df
shape: (83925, 55)
ID Enabled? created at updated at spp setid splsize pillbox_size splshape splshape_text pillbox_shape_text splscore pillbox_score splimprint pillbox_imprint splcolor splcolor_text pillbox_color_text spl_strength spl_ingredients spl_inactive_ing source rxtty rxstring rxcui RxNorm Update time product_code part_num part_medicine_name ndc9 ndc_labeler_code ndc_product_code medicine_name marketing_act_code effective_time file_name equal_product_code dosage_form document_type dea_schedule_code dea_schedule_name author_type author approval_code image_source splimage has_image epc_match version_number laberer_code application_number updated stale new Pillbox Value
i64 bool str str str str i64 str str str str i64 i64 str str str str str str str str str str str i64 str str i64 str i64 i64 i64 str str i64 str str str str str str str str str str str bool i64 i64 str str bool bool bool bool
41846 true "10/17/2017 05:... "10/02/2020 05:... "471fa2f1-73a0-... "471fa2f1-73a0-... 16 null "C48336" "CAPSULE" null 1 null "5892;V" null "C48328" "PINK" null "TEMAZEPAM 15 m... "TEMAZEPAM[TEMA... "SILICON DIOXID... "HRX" null null null "10/02/2020 04:... "0603-5892" 0 null 6035892 603 5892 "Temazepam" "completed" 20160406 "d912ca54-6569-... null "C25158" null "C48677" "CIV" "LABELER" "Qualitest Phar... "C73584" null null false null 5 null null true false false false
8100 true "10/17/2017 05:... "10/02/2020 04:... "116e13c1-ac50-... "116e13c1-ac50-... 10 null "C48348" "ROUND" null 1 null "I2" null "C48331" "ORANGE" null "IBUPROFEN 200 ... "IBUPROFEN[IBUP... "SILICON DIOXID... "HOTC" "SCD" "ibuprofen 200 ... 310965 "10/02/2020 03:... "59779-074" 0 null 597790074 59779 74 "ibuprofen" "active" 20191120 "55de9f94-89b2-... null "C42931" "34390-5" null null "LABELER" "CVS Pharmacy" "C73584" null null false null 4 null null true false false false
5258 true "10/17/2017 05:... "10/17/2017 05:... "827ce261-307b-... "827ce261-307b-... 7 null "C48346" "PENTAGON (5 SI... null 1 2 "par;129" null "C48329" "GREEN" null "DEXAMETHASONE ... "DEXAMETHASONE[... "ANHYDROUS LACT... "HRX" null "Dexamethasone ... 197583 null "49884-129" 0 null 498840129 49884 129 "Dexamethasone" "active" 20120516 "85a9eebb-be74-... null "C42998" "34391-3" null null "LABELER" "Par Pharmaceut... "C73584" "NLM" "498840129" true null 4 null null false true false true
21271 true "10/17/2017 05:... "10/02/2020 05:... "f7f1c99e-1a67-... "f7f1c99e-1a67-... 11 null "C48348" "ROUND" null 2 null "LL" null "C48325" "WHITE" null "Nickel Sulfate... "Nickel Sulfate... null "HOMEO" null null null "10/02/2020 04:... "61480-137" 0 null 614800137 61480 137 "Acunol" "active" 20190909 "029eaf64-e66f-... null "C42998" "34391-3" null null "LABELER" "PLYMOUTH HEALT... "C73614" null null false null 8 null null true false false false
77050 true "09/20/2019 09:... "10/02/2020 05:... "ecb28fcb-f0d1-... "ecb28fcb-f0d1-... 6 null "C48348" "ROUND" null 1 null "L;524" null "C48325" "WHITE" null "CLONAZEPAM 0.2... "CLONAZEPAM[CLO... "SORBITOL;ASPAR... "HRX" "SCD" "clonazepam 0.2... 349195 "10/02/2020 04:... "62332-365" 0 null 623320365 62332 365 "CLONAZEPAM" "active" 20190701 "ba5120ce-ed74-... null "C42999" null "C48677" "CIV" "LABELER" "Alembic Pharma... "C73584" null null false null 3 null null true false false false
76916 true "09/20/2019 09:... "10/02/2020 05:... "442e41da-24c2-... "442e41da-24c2-... 9 null "C48348" "ROUND" null 1 null "LU;V06" null "C48325" "WHITE" null "SILDENAFIL CIT... "SILDENAFIL CIT... "ANHYDROUS DIBA... "HRX" "SCD" "sildenafil 50 ... 312950 "10/02/2020 04:... "70748-132" 0 null 707480132 70748 132 "SILDENAFIL" "active" 20191001 "17c537d9-b2e6-... null "C42931" null null null "LABELER" "Lupin Pharmace... "C73584" null null false null 4 null null true false false false
20016 true "10/17/2017 05:... "10/02/2020 05:... "5f0bdf9d-fa78-... "5f0bdf9d-fa78-... 14 null "C48345" "OVAL" null 1 null "A;73" null "C48330" "YELLOW" null "RISPERIDONE 3 ... "RISPERIDONE[RI... "LACTOSE MONOHY... "HRX" "SCD" "risperidone 3 ... 312832 "10/02/2020 04:... "65862-123" 0 null 658620123 65862 123 "Risperidone" "active" 20180924 "5d2750a4-025a-... null "C42931" "34391-3" null null "LABELER" "Aurobindo Phar... "C73584" null null false null 21 null null true false false false
67902 true "06/27/2019 10:... "10/02/2020 05:... "572e672e-b759-... "572e672e-b759-... 19 null "C48336" "CAPSULE" null 1 null "AT146" null "C48333" "BLUE" null "IBUPROFEN 200 ... "IBUPROFEN[IBUP... "FD&C BLUE NO. ... "HOTC" "SCD" "ibuprofen 200 ... 310964 "10/02/2020 03:... "50804-750" 0 null 508040750 50804 750 "Ibuprofen" "active" 20191101 "148c7665-22d5-... null "C42954" null null null "LABELER" "Good Sense (Ge... "C73584" null null false null 2 null null true false false false
75997 true "09/20/2019 09:... "10/02/2020 05:... "6f17cc91-86b3-... "6f17cc91-86b3-... 12 null "C48348" "ROUND" null 1 null "T;12" null "C48325" "WHITE" null "Iloperidone 12... "Iloperidone[Il... "silicon dioxid... "HRX" "SCD" "iloperidone 12... 848732 "10/02/2020 04:... "51672-4184" 0 null 516724184 51672 4184 "Iloperidone" "active" 20190802 "6787555e-8a11-... null "C42998" null null null "LABELER" "Taro Pharmaceu... "C73584" null null false null 1 null null true false false false
1288 true "10/17/2017 05:... "10/17/2017 05:... "02a23e48-f371-... "02a23e48-f371-... 23 null "C48336" "CAPSULE" null 1 null "RX681" null "C48330;C48325" "YELLOW;WHITE" null "FENOPROFEN CAL... "FENOPROFEN CAL... "CROSPOVIDONE;M... "HRX" null "Fenoprofen 200... 1799325 null "54288-129" 0 null 542880129 54288 129 "FENORTHO" "active" 20160614 "91b0ac5b-994c-... null "C25158" null null null "LABELER" "BPI Labs LLC" "C73594" null null false null 2 null null false true false false
53601 true "10/17/2017 09:... "10/02/2020 05:... "d2213ffd-18f6-... "d2213ffd-18f6-... 12 null "C48348" "ROUND" null 1 null "BA;300" null "C48330" "YELLOW" null "BUTALBITAL 50 ... "BUTALBITAL[BUT... "STARCH, CORN;C... "HRX" "SCD" "acetaminophen ... 1249617 "10/02/2020 03:... "68682-306" 0 null 686820306 68682 306 "Butalbital and... "active" 20200402 "18ee9ab7-9e5d-... null "C42998" null "C48676" "CIII" "LABELER" "Oceanside Phar... "C73584" null null false null 3 null null true false false false
1528 true "10/17/2017 05:... "10/02/2020 05:... "3b8a7426-6f1c-... "3b8a7426-6f1c-... 6 null "C48348" "ROUND" null 1 null "M53;LU" null "C48325" "WHITE" null "ESTRADIOL 0.5 ... "ESTRADIOL[ESTR... "COPOVIDONE K25... "HRX" "BPCK" "{28 (estradiol... 1806683 "10/02/2020 03:... "68180-829" 0 null 681800829 68180 829 "AMABELZ" "completed" 20200928 "34d4f5d6-6526-... null "C42998" null null null "LABELER" "Lupin Pharmace... "C73584" null null false null 7 null null true false false false
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19475 true "10/17/2017 05:... "10/02/2020 05:... "be032f1e-c123-... "be032f1e-c123-... 8 null "C48348" "ROUND" null 1 null "M;104" null "C48325" "WHITE" null "MEMANTINE HYDR... "MEMANTINE HYDR... "SILICON DIOXID... "HRX" "SCD" "memantine hydr... 996561 "10/02/2020 04:... "0378-1104" 0 null 3781104 378 1104 "Memantine Hydr... "completed" 20140924 "4b207674-ac13-... null "C42998" null null null "LABELER" "Mylan Pharmace... "C73584" null null false null 5 null null true false false false
78691 true "12/06/2019 06:... "10/02/2020 05:... "3c7ef3cf-f7f9-... "3c7ef3cf-f7f9-... 20 null "C48336" "CAPSULE" null 1 null "P19" null "C48331" "ORANGE" null "ACETAMINOPHEN ... "ACETAMINOPHEN[... "BUTYLATED HYDR... "HOTC" "SCD" "acetaminophen ... 1086997 "10/02/2020 04:... "72476-848" 0 null 724760848 72476 848 "Multi-Symptom ... "active" 20191029 "c4f51f20-dd42-... null "C42954" null null null "LABELER" "Care One (Reta... "C73603" null null false null 1 null null true false false true
38030 true "10/17/2017 05:... "10/02/2020 05:... "e08920c2-04a3-... "e08920c2-04a3-... 9 null "C48348" "ROUND" null 1 null "D99" null "C48325" "WHITE" null "LAMOTRIGINE 25... "LAMOTRIGINE[LA... "MAGNESIUM CARB... "HRX" "SCD" "lamotrigine 25... 311264 "10/02/2020 04:... "65862-362" 0 null 658620362 65862 362 "Lamotrigine" "active" 20191029 "59344318-d7b4-... null "C42893" "34391-3" null null "LABELER" "Aurobindo Phar... "C73584" null null false null 20 null null true false false false
78672 true "12/06/2019 06:... "10/02/2020 05:... "ee1477ed-00c4-... "ee1477ed-00c4-... 17 null "C48345" "OVAL" null 1 null "AAA;1139" null "C48333" "BLUE" null "ACETAMINOPHEN ... "ACETAMINOPHEN[... "ACESULFAME POT... "HOTC" "GPCK" "{8 (acetaminop... 1801964 "10/02/2020 04:... "37808-286" 2 "Acetaminophen,... 378080286 37808 286 "Cold Flu Sever... null 20191004 "941e4166-0964-... null "C42897" null null null "LABELER" "HEB" "C73603" null null false null 2 null null true false false true
71471 true "06/29/2019 04:... "10/02/2020 05:... "a2754618-3df1-... "a2754618-3df1-... 19 null "C48345" "OVAL" null 1 null "600" null "C48325" "WHITE" null "AZITHROMYCIN D... "AZITHROMYCIN D... "CROSCARMELLOSE... "HRX" "SCD" "azithromycin 6... 204844 "10/02/2020 03:... "69452-173" 0 null 694520173 69452 173 "Azithromycin" "active" 20200624 "a8d7dd04-c391-... null "C42931" null null null "LABELER" "Bionpharma Inc... "C73584" null null false null 6 null null true false false false
68841 true "06/27/2019 10:... "10/02/2020 05:... "7af82ca5-ea36-... "7af82ca5-ea36-... 16 null "C48345" "OVAL" null 1 null "1007" null "C48333" "BLUE" null "IBUPROFEN 200 ... "IBUPROFEN[IBUP... "FD&C BLUE NO. ... "HOTC" "SCD" "diphenhydramin... 901814 "10/02/2020 03:... "36800-756" 0 null 368000756 36800 756 "Ibuprofen PM" "active" 20191014 "b181db0f-8b6a-... null "C42954" null null null "LABELER" "TOP CARE (Topc... "C73584" null null false null 2 null null true false false false
7862 true "10/17/2017 05:... "10/02/2020 05:... "edc05451-822e-... "edc05451-822e-... 10 null "C48345" "OVAL" null 1 null "4H2" null "C48325" "WHITE" null "CETIRIZINE HYD... "CETIRIZINE HYD... "STARCH, CORN;H... "HOTC" null null null "10/02/2020 03:... "49738-600" 0 null 497380600 49738 600 "smart sense al... "active" 20160721 "9ceb8c88-c221-... null "C42998" null null null "LABELER" "Kmart Corporat... "C73584" null null false null 3 null null true false false false
69440 true "06/27/2019 10:... "10/02/2020 05:... "facd5359-fc48-... "facd5359-fc48-... 12 null "C48345" "OVAL" null 1 null "20" null "C48332" "BROWN" null "OMEPRAZOLE 20 ... "OMEPRAZOLE[OME... "CARNAUBA WAX;F... "HOTC" "SCD" "omeprazole 20 ... 402014 "10/02/2020 03:... "70000-0356" 0 null 700000356 70000 356 "leader omepraz... "active" 20180316 "facd5359-fc48-... null "C42905" null null null "LABELER" "Cardinal Healt... "C73594" null null false null 1 null null true false false false
53092 true "10/17/2017 09:... "10/02/2020 05:... "b46f11ca-bd09-... "b46f11ca-bd09-... 19 null "C48348" "ROUND" null 1 null "L9Y7" null "C48328;C48331;... "PINK;ORANGE;YE... null "CALCIUM CARBON... "CALCIUM CARBON... "CITRIC ACID MO... "HOTC" "SCD" "calcium carbon... 308915 "10/02/2020 04:... "41163-508" 0 null 411630508 41163 508 "equaline antac... "active" 20181210 "10041dc3-e9bf-... null "C42893" null null null "LABELER" "Supervalu Inc" "C73603" null null false null 2 null null true false false false
4956 true "10/17/2017 05:... "10/02/2020 05:... "f8f84be1-e3b1-... "f8f84be1-e3b1-... 19 null "C48345" "OVAL" null 1 null "44;677" null "C48329" "GREEN" null "ACETAMINOPHEN ... "ACETAMINOPHEN[... "STARCH, CORN;D... "HOTC" "SCD" "acetaminophen ... 1546881 "10/02/2020 04:... "41250-877" 0 null 412500877 41250 877 "Nite time Seve... "active" 20200513 "a4d9c4a9-8e8f-... null "C42931" null null null "LABELER" "Meijer Distrib... "C73603" null null false null 8 null null true false false false
19029 true "10/17/2017 05:... "10/02/2020 05:... "254b2202-b14d-... "254b2202-b14d-... 19 null "C48336" "CAPSULE" null 1 null "APO;10;40" null "C48333" "BLUE" null "Amlodipine bes... "Amlodipine bes... "Cellulose, mic... "HRX" null null null "10/02/2020 03:... "60505-3226" 0 null 605053226 60505 3226 "Amlodipine and... "active" 20170818 "b332e90b-8d4a-... null "C25158" null null null "LABELER" "Apotex Corp." "C73584" null null false null 6 null null true false false false
13396 true "10/17/2017 05:... "10/17/2017 05:... "cec47488-ebad-... "cec47488-ebad-... 8 null "C48348" "ROUND" null 1 null "DF;15" null "C48331" "ORANGE" null "DARIFENACIN 15... "DARIFENACIN[DA... "ANHYDROUS DIBA... "HRX" null "24 HR darifena... 543021 null "35356-272" 0 null 353560272 35356 272 "Enablex" "active" 20120305 "85782ed3-ab22-... "0078-0420" "C42927" "34391-3" null null "LABELER" "Lake Erie Medi... "C73594" null null false 1 3613 null null false true false false

When importing pillbox.csv file initially, an error message actually came up that showed, “…Could not parse ‘10.16’ as dtype Int64 at column 7…”. One way to get around this was to add “ignore_errors” to bypass this error first in order to load the dataset first. This error could be fixed when checking and converting data types for columns.


Initial data wrangling

The Pillbox dataset link from NLM provided a list of column information for users. To quickly see what were the columns in the dataset, we could use “df.glimpse()” to read column names, data types and the first 10 items in each column.

print(df.glimpse())
Rows: 83925
Columns: 55
$ ID                   <Int64> 41846, 8100, 5258, 21271, 77050, 76916, 20016, 67902, 75997, 1288        
$ Enabled?           <Boolean> True, True, True, True, True, True, True, True, True, True               
$ created at            <Utf8> 10/17/2017 05:32:23 PM, 10/17/2017 05:29:56 PM, 10/17/2017 05:29:44 PM, 10/17/2017 05:30:52 PM, 09/20/2019 09:10:47 PM, 09/20/2019 09:10:41 PM, 10/17/2017 05:30:47 PM, 06/27/2019 10:45:17 PM, 09/20/2019 09:09:57 PM, 10/17/2017 05:29:25 PM
$ updated at            <Utf8> 10/02/2020 05:14:07 PM, 10/02/2020 04:59:28 PM, 10/17/2017 05:29:44 PM, 10/02/2020 05:10:28 PM, 10/02/2020 05:14:51 PM, 10/02/2020 05:15:50 PM, 10/02/2020 05:12:38 PM, 10/02/2020 05:04:25 PM, 10/02/2020 05:12:06 PM, 10/17/2017 05:29:25 PM
$ spp                   <Utf8> 471fa2f1-73a0-49be-89f3-d3e2cfdaeca0-0603-5892-0, 116e13c1-ac50-400f-880f-5779f0155b96-59779-074-0, 827ce261-307b-4398-8993-333c08e601fe-49884-129-0, f7f1c99e-1a67-4b34-b1f4-0ac38b9d8006-61480-137-0, ecb28fcb-f0d1-4558-b460-ecabd0f6009e-62332-365-0, 442e41da-24c2-412f-be6b-d549692943fd-70748-132-0, 5f0bdf9d-fa78-45e8-913a-81beff57cf34-65862-123-0, 572e672e-b759-4db1-9e8f-279b1f6f3c51-50804-750-0, 6f17cc91-86b3-42e3-9bf2-935dd360c3eb-51672-4184-0, 02a23e48-f371-448b-92b2-e2d010be1886-54288-129-0
$ setid                 <Utf8> 471fa2f1-73a0-49be-89f3-d3e2cfdaeca0, 116e13c1-ac50-400f-880f-5779f0155b96, 827ce261-307b-4398-8993-333c08e601fe, f7f1c99e-1a67-4b34-b1f4-0ac38b9d8006, ecb28fcb-f0d1-4558-b460-ecabd0f6009e, 442e41da-24c2-412f-be6b-d549692943fd, 5f0bdf9d-fa78-45e8-913a-81beff57cf34, 572e672e-b759-4db1-9e8f-279b1f6f3c51, 6f17cc91-86b3-42e3-9bf2-935dd360c3eb, 02a23e48-f371-448b-92b2-e2d010be1886
$ splsize              <Int64> 16, 10, 7, 11, 6, 9, 14, 19, 12, 23                                      
$ pillbox_size          <Utf8> None, None, None, None, None, None, None, None, None, None               
$ splshape              <Utf8> C48336, C48348, C48346, C48348, C48348, C48348, C48345, C48336, C48348, C48336
$ splshape_text         <Utf8> CAPSULE, ROUND, PENTAGON (5 SIDED), ROUND, ROUND, ROUND, OVAL, CAPSULE, ROUND, CAPSULE
$ pillbox_shape_text    <Utf8> None, None, None, None, None, None, None, None, None, None               
$ splscore             <Int64> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1                                             
$ pillbox_score        <Int64> None, None, 2, None, None, None, None, None, None, None                  
$ splimprint            <Utf8> 5892;V, I2, par;129, LL, L;524, LU;V06, A;73, AT146, T;12, RX681         
$ pillbox_imprint       <Utf8> None, None, None, None, None, None, None, None, None, None               
$ splcolor              <Utf8> C48328, C48331, C48329, C48325, C48325, C48325, C48330, C48333, C48325, C48330;C48325
$ splcolor_text         <Utf8> PINK, ORANGE, GREEN, WHITE, WHITE, WHITE, YELLOW, BLUE, WHITE, YELLOW;WHITE
$ pillbox_color_text    <Utf8> None, None, None, None, None, None, None, None, None, None               
$ spl_strength          <Utf8> TEMAZEPAM 15 mg;, IBUPROFEN 200 mg;, DEXAMETHASONE 6 mg;, Nickel Sulfate 1 [hp_X];Potassium Bromide 1 [hp_X];Sodium Bromide 1 [hp_X];Zinc Sulfate Anhydrous 1 [hp_X];Sulfur 1 [hp_X];, CLONAZEPAM 0.25 mg;, SILDENAFIL CITRATE 50 mg;, RISPERIDONE 3 mg;, IBUPROFEN 200 mg;, Iloperidone 12 mg;, FENOPROFEN CALCIUM 200 mg;
$ spl_ingredients       <Utf8> TEMAZEPAM[TEMAZEPAM];, IBUPROFEN[IBUPROFEN];, DEXAMETHASONE[DEXAMETHASONE];, Nickel Sulfate[NICKEL CATION];Potassium Bromide[BROMIDE ION];Sodium Bromide[BROMIDE ION];Zinc Sulfate Anhydrous[ZINC CATION];Sulfur[Sulfur];, CLONAZEPAM[CLONAZEPAM];, SILDENAFIL CITRATE[SILDENAFIL];, RISPERIDONE[RISPERIDONE];, IBUPROFEN[IBUPROFEN];, Iloperidone[Iloperidone];, FENOPROFEN CALCIUM[FENOPROFEN];
$ spl_inactive_ing      <Utf8> SILICON DIOXIDE;EDETATE DISODIUM;LACTOSE MONOHYDRATE;MAGNESIUM STEARATE;CELLULOSE, MICROCRYSTALLINE;STARCH, CORN;SODIUM LAURYL SULFATE;FD&C BLUE NO. 1;FD&C RED NO. 40;GELATIN;TITANIUM DIOXIDE;BUTYL ALCOHOL;, SILICON DIOXIDE;STARCH, CORN;CROSCARMELLOSE SODIUM;FD&C RED NO. 40;FD&C YELLOW NO. 6;FERRIC OXIDE RED;MICROCRYSTALLINE CELLULOSE;POLYETHYLENE GLYCOL, UNSPECIFIED;POLYVINYL ALCOHOL, UNSPECIFIED;STEARIC ACID;TALC;TITANIUM DIOXIDE;, ANHYDROUS LACTOSE;CELLULOSE, MICROCRYSTALLINE;CROSCARMELLOSE SODIUM;STEARIC ACID;MAGNESIUM STEARATE;FD&C BLUE NO. 1;D&C YELLOW NO. 10;FD&C YELLOW NO. 6;, None, SORBITOL;ASPARTAME;SODIUM LAURYL SULFATE;CROSPOVIDONE;MANNITOL;SILICON DIOXIDE;TALC;MAGNESIUM STEARATE;, ANHYDROUS DIBASIC CALCIUM PHOSPHATE;CELLULOSE, MICROCRYSTALLINE;CROSCARMELLOSE SODIUM;HYPROMELLOSE 2910 (6 MPA.S);MAGNESIUM STEARATE;POLYETHYLENE GLYCOL 400;SILICON DIOXIDE;TITANIUM DIOXIDE;, LACTOSE MONOHYDRATE;MICROCRYSTALLINE CELLULOSE;SILICON DIOXIDE;MAGNESIUM STEARATE;HYPROMELLOSE 2910 (6 MPA.S);TITANIUM DIOXIDE;POLYETHYLENE GLYCOL 400;D&C YELLOW NO. 10;, FD&C BLUE NO. 1;GELATIN;POLYETHYLENE GLYCOL, UNSPECIFIED;POTASSIUM HYDROXIDE;WATER;SORBITOL;SORBITAN;MEDIUM-CHAIN TRIGLYCERIDES;FD&C YELLOW NO. 6;LECITHIN, SOYBEAN;, silicon dioxide;crospovidone (15 MPA.S AT 5%);hypromellose, unspecified;lactose monohydrate;magnesium stearate;microcrystalline cellulose;water;, CROSPOVIDONE;MAGNESIUM STEARATE;SODIUM LAURYL SULFATE;TALC;GELATIN;TITANIUM DIOXIDE;BROWN IRON OXIDE;
$ source                <Utf8> HRX, HOTC, HRX, HOMEO, HRX, HRX, HRX, HOTC, HRX, HRX                     
$ rxtty                 <Utf8> None, SCD, None, None, SCD, SCD, SCD, SCD, SCD, None                     
$ rxstring              <Utf8> None, ibuprofen 200 MG Oral Tablet, Dexamethasone 6 MG Oral Tablet, None, clonazepam 0.25 MG Disintegrating Oral Tablet, sildenafil 50 MG Oral Tablet, risperidone 3 MG Oral Tablet, ibuprofen 200 MG Oral Capsule, iloperidone 12 MG Oral Tablet, Fenoprofen 200 MG Oral Capsule [Fenortho]
$ rxcui                <Int64> None, 310965, 197583, None, 349195, 312950, 312832, 310964, 848732, 1799325
$ RxNorm Update time    <Utf8> 10/02/2020 04:21:55 PM, 10/02/2020 03:07:40 PM, None, 10/02/2020 04:01:35 PM, 10/02/2020 04:25:40 PM, 10/02/2020 04:30:41 PM, 10/02/2020 04:13:39 PM, 10/02/2020 03:30:07 PM, 10/02/2020 04:10:33 PM, None
$ product_code          <Utf8> 0603-5892, 59779-074, 49884-129, 61480-137, 62332-365, 70748-132, 65862-123, 50804-750, 51672-4184, 54288-129
$ part_num             <Int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0                                             
$ part_medicine_name    <Utf8> None, None, None, None, None, None, None, None, None, None               
$ ndc9                 <Int64> 6035892, 597790074, 498840129, 614800137, 623320365, 707480132, 658620123, 508040750, 516724184, 542880129
$ ndc_labeler_code     <Int64> 603, 59779, 49884, 61480, 62332, 70748, 65862, 50804, 51672, 54288       
$ ndc_product_code     <Int64> 5892, 74, 129, 137, 365, 132, 123, 750, 4184, 129                        
$ medicine_name         <Utf8> Temazepam, ibuprofen, Dexamethasone, Acunol, CLONAZEPAM, SILDENAFIL, Risperidone, Ibuprofen, Iloperidone, FENORTHO
$ marketing_act_code    <Utf8> completed, active, active, active, active, active, active, active, active, active
$ effective_time       <Int64> 20160406, 20191120, 20120516, 20190909, 20190701, 20191001, 20180924, 20191101, 20190802, 20160614
$ file_name             <Utf8> d912ca54-6569-4e58-a8ef-620eddd03163.xml, 55de9f94-89b2-4bfb-a41d-6660ba6e7a6d.xml, 85a9eebb-be74-43a1-a36f-26ae4c4131aa.xml, 029eaf64-e66f-447e-9ac3-037620370f85.xml, ba5120ce-ed74-40be-936d-c172805d88d1.xml, 17c537d9-b2e6-4d71-a481-c7c7cafdb3a2.xml, 5d2750a4-025a-40a1-97d2-d9447a37afbb.xml, 148c7665-22d5-494c-9add-98428435f392.xml, 6787555e-8a11-481c-b05c-179b0aedcf5c.xml, 91b0ac5b-994c-468f-9331-542b8f92f9a8.xml
$ equal_product_code    <Utf8> None, None, None, None, None, None, None, None, None, None               
$ dosage_form           <Utf8> C25158, C42931, C42998, C42998, C42999, C42931, C42931, C42954, C42998, C25158
$ document_type         <Utf8> None, 34390-5, 34391-3, 34391-3, None, None, 34391-3, None, None, None   
$ dea_schedule_code     <Utf8> C48677, None, None, None, C48677, None, None, None, None, None           
$ dea_schedule_name     <Utf8> CIV, None, None, None, CIV, None, None, None, None, None                 
$ author_type           <Utf8> LABELER, LABELER, LABELER, LABELER, LABELER, LABELER, LABELER, LABELER, LABELER, LABELER
$ author                <Utf8> Qualitest Pharmaceuticals, CVS Pharmacy, Par Pharmaceutical Inc., PLYMOUTH HEALTHCARE PRODUCTS LLC, Alembic Pharmaceuticals Inc., Lupin Pharmaceuticals, Inc., Aurobindo Pharma Limited, Good Sense (Geiss, Destin & Dunn, Inc.), Taro Pharmaceuticals U.S.A., Inc., BPI Labs LLC
$ approval_code         <Utf8> C73584, C73584, C73584, C73614, C73584, C73584, C73584, C73584, C73584, C73594
$ image_source          <Utf8> None, None, NLM, None, None, None, None, None, None, None                
$ splimage              <Utf8> None, None, 498840129, None, None, None, None, None, None, None          
$ has_image          <Boolean> False, False, True, False, False, False, False, False, False, False      
$ epc_match            <Int64> None, None, None, None, None, None, None, None, None, None               
$ version_number       <Int64> 5, 4, 4, 8, 3, 4, 21, 2, 1, 2                                            
$ laberer_code          <Utf8> None, None, None, None, None, None, None, None, None, None               
$ application_number    <Utf8> None, None, None, None, None, None, None, None, None, None               
$ updated            <Boolean> True, True, False, True, True, True, True, True, True, False             
$ stale              <Boolean> False, False, True, False, False, False, False, False, False, True       
$ new                <Boolean> False, False, False, False, False, False, False, False, False, False     
$ Pillbox Value      <Boolean> False, False, True, False, False, False, False, False, False, False      

A relatively simple dataset would be extracted first for these pills data since I was an inexperienced user of Rust. Therefore, I’ve selected only certain columns for this purpose.

df_med = df.select([# shapes of medicines
                    "splshape_text", 
                    # colours of medicines
                    "splcolor_text",
                    # strengths of medicines
                    "spl_strength", 
                    # inactive ingredients/excipients in medicines  
                    "spl_inactive_ing",
                    # dosage forms of medicines e.g. capsules or tablets etc.
                    "dosage_form"]
                  )
df_med
shape: (83925, 5)
splshape_text splcolor_text spl_strength spl_inactive_ing dosage_form
str str str str str
"CAPSULE" "PINK" "TEMAZEPAM 15 m... "SILICON DIOXID... "C25158"
"ROUND" "ORANGE" "IBUPROFEN 200 ... "SILICON DIOXID... "C42931"
"PENTAGON (5 SI... "GREEN" "DEXAMETHASONE ... "ANHYDROUS LACT... "C42998"
"ROUND" "WHITE" "Nickel Sulfate... null "C42998"
"ROUND" "WHITE" "CLONAZEPAM 0.2... "SORBITOL;ASPAR... "C42999"
"ROUND" "WHITE" "SILDENAFIL CIT... "ANHYDROUS DIBA... "C42931"
"OVAL" "YELLOW" "RISPERIDONE 3 ... "LACTOSE MONOHY... "C42931"
"CAPSULE" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "C42954"
"ROUND" "WHITE" "Iloperidone 12... "silicon dioxid... "C42998"
"CAPSULE" "YELLOW;WHITE" "FENOPROFEN CAL... "CROSPOVIDONE;M... "C25158"
"ROUND" "YELLOW" "BUTALBITAL 50 ... "STARCH, CORN;C... "C42998"
"ROUND" "WHITE" "ESTRADIOL 0.5 ... "COPOVIDONE K25... "C42998"
... ... ... ... ...
"ROUND" "WHITE" "MEMANTINE HYDR... "SILICON DIOXID... "C42998"
"CAPSULE" "ORANGE" "ACETAMINOPHEN ... "BUTYLATED HYDR... "C42954"
"ROUND" "WHITE" "LAMOTRIGINE 25... "MAGNESIUM CARB... "C42893"
"OVAL" "BLUE" "ACETAMINOPHEN ... "ACESULFAME POT... "C42897"
"OVAL" "WHITE" "AZITHROMYCIN D... "CROSCARMELLOSE... "C42931"
"OVAL" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "C42954"
"OVAL" "WHITE" "CETIRIZINE HYD... "STARCH, CORN;H... "C42998"
"OVAL" "BROWN" "OMEPRAZOLE 20 ... "CARNAUBA WAX;F... "C42905"
"ROUND" "PINK;ORANGE;YE... "CALCIUM CARBON... "CITRIC ACID MO... "C42893"
"OVAL" "GREEN" "ACETAMINOPHEN ... "STARCH, CORN;D... "C42931"
"CAPSULE" "BLUE" "Amlodipine bes... "Cellulose, mic... "C25158"
"ROUND" "ORANGE" "DARIFENACIN 15... "ANHYDROUS DIBA... "C42927"


Web scraping

This was not planned initially but this might make my life a lot easier if I could scrape the dosage form table found through the Pillbox link, since the dosage form column was full of C-letter codes. These dosage form codes were hard to understand, so once I’ve got the codes along with corresponding dosage forms in texts, the web-scraped information would be converted into a dataframe for further data manipulations.

# Uncomment lines below to install libraries needed for web-scraping
#!pip install requests
#!pip install beautifulsoup4
Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

I’ve opted for using Beautiful Soup as the web-scraping library in Python, along with the requests library to be able to make a URL request call to retrieve web information. There were of course many other tools available as well. A caveat to be taken into consideration was that when web-scraping, it was always recommended to check whether the information being scraped were under a specific copyright license and so on. In this case, I’ve checked that the dosage form table link - https://www.fda.gov/industry/structured-product-labeling-resources/dosage-forms was from US FDA and it was stated that the information (both texts and graphs) were not copyrighted (unless otherwise stated, for this particular web page, there was nothing stated along those lines), but a link to this webpage should be provided so that readers could access most current information in the future.

Send web requests
# Specify URL address with information intended for web-scraping
url = "https://www.fda.gov/industry/structured-product-labeling-resources/dosage-forms"
# Request the web information via requests library & save under a data object
data = requests.get(url)
Parse web content
# Parse the web content from the URL link by using Beautiful Soup
soup = BeautifulSoup(data.content, "html.parser")

Transform web-scraped data into dataframe


Using Pandas dataframe library
Pandas.append()

The original pandas.append() method was going to be deprecated in future versions of Pandas. This old method was shown as below:

```{python}
# Create an empty dataframe with columns named "Dosage_form" & "Code"
dosage_form = pd.DataFrame(columns = ["Dosage_form", "Code"])

# Create a loop to find all <tr> tags in the soup object (scraped html content)
for row in soup.find_all("tr"): 
  # Set the columns to contain contents under <td> tags by searching all rows
  col = row.find_all("td") 
    # if columns are not an empty list, 
    # add the texts under columns in specified orders
    if (col != []): 
    dosage = col[0].text 
    code = col[1].text 

# Append each text item into the dosage_form dataframe
dosage_form = dosage_form.append({"Dosage_form":dosage, "Code":code}, ignore_index = True)

# Show dataframe
dosage_form
```

This method might still work currently, however, the newer and recommended methods would be to use the pandas.concat() method as shown below.

Pandas.concat()

First example:

# Create an empty dictionary
dict = []

# Create a loop to iterate through html tags from the soup (scraped html content)
# find all html tags that began with <tr>
for row in soup.find_all("tr"):
    # each column would hold the items under <td> tags
    col = row.find_all("td")
    if (col != []): 
        # dosage form in column 1
        dosage = col[0].text
        # code in column 2
        code = col[1].text
        # Append each dosage form & code into the dictionary
        dict.append({"DosageForm": dosage, "dosage_form": code})

# Check if the loop was iterating through the html tags
# and that it was appending each dosage form & code into the dictionary 
# Uncomment line below
#print(dict)

# Create an empty dataframe with the column names wanted
dosage_form = pd.DataFrame(columns = ["DosageForm", "dosage_form"])

# Concatenate the dosage_form dataframe with the dataframe converted from dict
df_new = pd.concat([dosage_form, pd.DataFrame.from_dict(dict)])

# Print the combined dataframe df_new
df_new
DosageForm dosage_form
0 AEROSOL C42887
1 AEROSOL, FOAM C42888
2 AEROSOL, METERED C42960
3 AEROSOL, POWDER C42971
4 AEROSOL, SPRAY C42889
... ... ...
153 TAMPON C47892
154 TAPE C47897
155 TINCTURE C43000
156 TROCHE C43001
157 WAFER C43003

158 rows × 2 columns

Pandas.from_dict()

Second example by using pd.from_dict() method, which might have less lines of codes:

# Create an empty dictionary
dict = []

# Create a loop to iterate through html tags from the soup (scraped html content)
# find all html tags that began with <tr>
for row in soup.find_all("tr"):
    # each column would hold the items under <td> tags
    col = row.find_all("td")
    if (col != []): 
        # dosage form in column 1
        dosage = col[0].text
        # code in column 2
        code = col[1].text
        # Append each dosage form & code into the dict
        dict.append({"DosageForm": dosage, "dosage_form": code})

# Check if the loop was working to iterate through the html tags
# and that it was appending each dosage form & code into the dictionary 
# Uncomment line below
#print(dict)

# Convert the dictionary into a dataframe
df_new = pd.DataFrame.from_dict(dict)

# Print the dataframe df_new
df_new
DosageForm dosage_form
0 AEROSOL C42887
1 AEROSOL, FOAM C42888
2 AEROSOL, METERED C42960
3 AEROSOL, POWDER C42971
4 AEROSOL, SPRAY C42889
... ... ...
153 TAMPON C47892
154 TAPE C47897
155 TINCTURE C43000
156 TROCHE C43001
157 WAFER C43003

158 rows × 2 columns

Using Polars dataframe library

Polars dataframe library also had a from_dict() method that could convert dictionary into a dataframe as shown below:

# Create an empty dictionary
dict = []

# Create a loop to iterate through html tags from the soup (scraped html content)
# find all html tags that began with <tr>
for row in soup.find_all("tr"):
    # each column would hold the items under <td> tags
    col = row.find_all("td")
    if (col != []): 
        # dosage form in column 1
        dosage = col[0].text
        # code in column 2
        code = col[1].text
        # Append each dosage form & code into the dict
        dict.append({"DosageForm": dosage, "dosage_form": code})

# Check if the loop was iterating through the html tags
# and that it was also appending each dosage form & code into the dictionary 
# Uncomment line below
#print(dict)

# Convert dictionary to dataframe
new_df = pl.from_dicts(dict)
new_df
shape: (158, 2)
DosageForm dosage_form
str str
"AEROSOL" "C42887"
"AEROSOL, FOAM" "C42888"
"AEROSOL, METER... "C42960"
"AEROSOL, POWDE... "C42971"
"AEROSOL, SPRAY... "C42889"
"BAR, CHEWABLE" "C42892"
"BEAD" "C42890"
"CAPSULE" "C25158"
"CAPSULE, COATE... "C42895"
"CAPSULE, COATE... "C42896"
"CAPSULE, COATE... "C42917"
"CAPSULE, DELAY... "C42902"
... ...
"TABLET, MULTIL... "C42964"
"TABLET, MULTIL... "C42963"
"TABLET, ORALLY... "C42999"
"TABLET, ORALLY... "C61006"
"TABLET, SOLUBL... "C42985"
"TABLET, SUGAR ... "C42992"
"TABLET WITH SE... "C147579"
"TAMPON" "C47892"
"TAPE" "C47897"
"TINCTURE" "C43000"
"TROCHE" "C43001"
"WAFER" "C43003"

Preparation of dataframe for data visualisation

Once we have the scraped dataframe ready, we could combine it with our original dataframe from the .csv file (the idea was basically doing dataframe join). Then the dosage form code column could be removed to make it easier to read.

# Join the two dataframes together
df_final = df_med.join(new_df, on = "dosage_form")
# Drop the column dosage_form which had codes of each dosage form 
df_final = df_final.drop("dosage_form")
df_final
shape: (83925, 5)
splshape_text splcolor_text spl_strength spl_inactive_ing DosageForm
str str str str str
"CAPSULE" "PINK" "TEMAZEPAM 15 m... "SILICON DIOXID... "CAPSULE"
"ROUND" "ORANGE" "IBUPROFEN 200 ... "SILICON DIOXID... "TABLET, FILM C...
"PENTAGON (5 SI... "GREEN" "DEXAMETHASONE ... "ANHYDROUS LACT... "TABLET"
"ROUND" "WHITE" "Nickel Sulfate... null "TABLET"
"ROUND" "WHITE" "CLONAZEPAM 0.2... "SORBITOL;ASPAR... "TABLET, ORALLY...
"ROUND" "WHITE" "SILDENAFIL CIT... "ANHYDROUS DIBA... "TABLET, FILM C...
"OVAL" "YELLOW" "RISPERIDONE 3 ... "LACTOSE MONOHY... "TABLET, FILM C...
"CAPSULE" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"ROUND" "WHITE" "Iloperidone 12... "silicon dioxid... "TABLET"
"CAPSULE" "YELLOW;WHITE" "FENOPROFEN CAL... "CROSPOVIDONE;M... "CAPSULE"
"ROUND" "YELLOW" "BUTALBITAL 50 ... "STARCH, CORN;C... "TABLET"
"ROUND" "WHITE" "ESTRADIOL 0.5 ... "COPOVIDONE K25... "TABLET"
... ... ... ... ...
"ROUND" "WHITE" "MEMANTINE HYDR... "SILICON DIOXID... "TABLET"
"CAPSULE" "ORANGE" "ACETAMINOPHEN ... "BUTYLATED HYDR... "CAPSULE, LIQUI...
"ROUND" "WHITE" "LAMOTRIGINE 25... "MAGNESIUM CARB... "TABLET, CHEWAB...
"OVAL" "BLUE" "ACETAMINOPHEN ... "ACESULFAME POT... "TABLET, COATED...
"OVAL" "WHITE" "AZITHROMYCIN D... "CROSCARMELLOSE... "TABLET, FILM C...
"OVAL" "BLUE" "IBUPROFEN 200 ... "FD&C BLUE NO. ... "CAPSULE, LIQUI...
"OVAL" "WHITE" "CETIRIZINE HYD... "STARCH, CORN;H... "TABLET"
"OVAL" "BROWN" "OMEPRAZOLE 20 ... "CARNAUBA WAX;F... "TABLET, DELAYE...
"ROUND" "PINK;ORANGE;YE... "CALCIUM CARBON... "CITRIC ACID MO... "TABLET, CHEWAB...
"OVAL" "GREEN" "ACETAMINOPHEN ... "STARCH, CORN;D... "TABLET, FILM C...
"CAPSULE" "BLUE" "Amlodipine bes... "Cellulose, mic... "CAPSULE"
"ROUND" "ORANGE" "DARIFENACIN 15... "ANHYDROUS DIBA... "TABLET, EXTEND...


Here, we could save the intended dataframe for data visualisation as a .csv file, so that further data wrangling and mining could be done later for part 2. This also avoided making request calls to the website again and again by extracting the scraped web information as a stand-alone file which could be imported when needed later on.

# Save the inital cleaned dataframe as .csv file
# for use in a new .ipynb file with Rust kernel
df_final.write_csv("pills.csv", sep = ",")