Table scraping from PDF

Using tabula-py in Python

Data analytics projects
Python
Long COVID
Author

Jennifer HY Lin

Published

September 15, 2022

Quick introduction

Recently I had the idea of continuing the long COVID exploration and thought that I’ve never tried scraping a PDF before, so by combining these two ideas together, I ended up with this little piece of work as another post.

A quick heads up: Java should be installed in order for tabula-py to work seamlessly, since tabula-py is actually a Python wrapper for tabula-java. In this case, I’ve relied on Homebrew to install Java, but there are several other different options available online and I’ll leave this open for people who’re interested to explore themselves. Once it’s installed, we can then check for the Java version to ensure it’s installed properly.

# Check the version of Java
!java -version
openjdk version "17.0.4.1" 2022-08-12
OpenJDK Runtime Environment Temurin-17.0.4.1+1 (build 17.0.4.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.4.1+1 (build 17.0.4.1+1, mixed mode, sharing)
Installing and importing libraries

Then we would install any libraries needed for scraping table data from PDF, which in this case, I ended up using only one library.

!pip install -q tabula-py
# import read_pdf from the tabula library
from tabula import read_pdf
Data source

Source of the table was from this journal paper by Healey Q, Sheikh A, Daines L, Vasileiou E. Symptoms and signs of long COVID: A rapid review and meta-analysis. J Glob Health 2022;12:05014. Creative Commons Attribution 4.0 International Public License

Photo by Steve Richey on Unsplash

Table scraping

Firstly, I trialled scraping the table from page 4 of the journal paper, which only really scraped about half of the table. I then went on to add in another line of code to specify the scraping area1 on the PDF page in inches (this part could be deduced by using the in-built PDF tool).

One thing I wasn’t too sure about was that the tabula-py documentation did state that the default = full page, but in fact, it appeared to be not the case (only half of the table showed up). Also, the journal paper I was using had the tables printed in landscape layout (rather than the more common portrait style), so it wasn’t completely clear if landscape version was making this harder or the other way.

#specify the scraping area (top, left, bottom, right)
test_area = "10.05,6.60,10.05,6.60" 
df = read_pdf("Journal.pdf", pages = "4", area = test_area, guess = False, stream = True, pandas_options={'header':None})
df
[                                                    0
 0   VREIESWEAPROCINHT TSHEME 1:  Healey et al. COV...
 1    Table 1. Characteristics of the included studies
 2                             Author Hospital (%) Age
 3   (country) {ICU (%)} (years) Comorbiditiestime ...
 4   41% hypertension, 15% diabetes, Generalised/MS...
 5   11% obesity, 11% endocrine disease, Respirator...
 6   10% malignancy, 9% IHD, 8% Neuropsychiatric 43...
 7    Bellan (Italy) dyslipidaemia, 7% AF, 6% COPD, 6%
 8   100 {12} 61 107 ENT 5% gustatory dysfunction, ...
 9             [19] CKD, 6% haematological disease, 5%
 10             anxiety/depression, 4% cerebrovascular
 11  disease, 3% liver disease, 3% VTE, 2% Gastroin...
 12                         IBD, 2% autoimmune disease
 13  Generalised/MSK Fatigue, myalgia, arthralgia, ...
 14  Respiratory Dyspnoea, cough, chest pain, sputu...
 15       Bliddal 28% allergy, 17% osteoarthritis, 15%
 16  Neuropsychiatric Memory issues, concentration ...
 17  (Denmark) 0 50 hypertension, 9% thyroid diseas...
 18  ENT Olfactory dysfunction, gustatory dysfuncti...
 19                                        [20] asthma
 20  Gastrointestinal Diarrhoea, anorexia, abdomina...
 21                              Others Red runny eyes
 22        Chiesa- 6% hypertension, 6% hypothyroidism,
 23  Estomba Not stated 41 6% asthma, 4% autoimmune...
 24          (Italy) [21] 3% diabetes, 2% IHD, 1% COPD
 25                                             Cousyn
 26  0 35 Not stated 60 ENT 16.8% olfactory dysfunc...
 27                                      (France) [22]
 28  Generalised/MSK 45% fatigue, 15% myalgia, 3% f...
 29  33% dyspnoea, 33% cough. Normal spirometry, no...
 30                                        Respiratory
 31                                   distance on 6MWT
 32  Neuropsychiatric 18% cognitive issues, 15% hea...
 33          Daher 59% hypertension, 25% diabetes, 22%
 34  ENT 12% olfactory dysfunction, 12% rhinorrhoea...
 35   (Germany) 100 64 CKD, 19% IHD, 13% asthma, 9% 56
 36  9% diarrhoea, 6% nausea, 3% abdominal pain, no...
 37  18% angina, normal left ventricular function, ...
 38                                     Cardiovascular
 39                                         biomarkers
 40  Normal FBC, normal coagulation screen, raised ...
 41                                   Other biomarkers
 42  U&Es, normal CRP, normal procalcitonin, normal...
 43  26% hypertension, 12% diabetes, Generalised/MS...
 44                                         Fernandez-
 45                 12% IHD, 7% asthma, 5% obesity, 4%
 46                        de-Las-Penas 100 {7} 61 340
 47  COPD, 2% cerebrovascular disease, 2% Respirato...
 48                                       (Spain) [23]
 49                            rheumatological disease
 50  47% hypertension, 42% dyslipidaemia, Generalis...
 51                                           Froidure
 52  28% obesity, 22% diabetes, 9% Abnormal chest C...
 53                           (Belgium) 100 {22} 60 98
 54  asthma, 4% COPD, 2% lung cancer, Respiratory t...
 55                                               [24]
 56  1% ILD cough, 4% chest tightness, normal spiro...
 57  2022  •  Vol. 12  •  05014 4 www.jogh.org •  d...]

Once above worked, I moved onto scraping the whole table across pages 4 to 6 of the PDF, and then saved the scraped table into a .csv file, which appeared automatically in the working directory.

import tabula
test_area = "10.05,6.60,10.05,6.60"
# Convert and save scraped data into specified file format
tabula.convert_into("Journal.pdf", "Full_table_scraped.csv", output_format = "csv", pages = "4-6", area = test_area, guess = False, stream = True)
!cat Full_table_scraped.csv
VREIESWEAPROCINHT TSHEME 1:  Healey et al. COVID-19 PANDEMIC
Table 1. Characteristics of the included studies
Author Hospital (%) Age
(country) {ICU (%)} (years) Comorbiditiestime (days) Follow-up Body system Results
"41% hypertension, 15% diabetes, Generalised/MSK 5.9% myalgia, 5.9% arthralgia"
"11% obesity, 11% endocrine disease, Respiratory 5.5% dyspnoea, 2.5% cough, 0.4% chest pain, 51.6% reduced DLCO, normal spirometry"
"10% malignancy, 9% IHD, 8% Neuropsychiatric 43% PTSD symptoms"
"Bellan (Italy) dyslipidaemia, 7% AF, 6% COPD, 6%"
"100 {12} 61 107 ENT 5% gustatory dysfunction, 4.6% olfactory dysfunction"
"[19] CKD, 6% haematological disease, 5%"
"anxiety/depression, 4% cerebrovascular"
"disease, 3% liver disease, 3% VTE, 2% Gastrointestinal 1.3% diarrhoea"
"IBD, 2% autoimmune disease"
"Generalised/MSK Fatigue, myalgia, arthralgia, chills, fever"
"Respiratory Dyspnoea, cough, chest pain, sputum production"
"Bliddal 28% allergy, 17% osteoarthritis, 15%"
"Neuropsychiatric Memory issues, concentration issues, headache"
"(Denmark) 0 50 hypertension, 9% thyroid disease, 8% 84"
"ENT Olfactory dysfunction, gustatory dysfunction, sore throat, rhinorrhoea, sneezing"
[20] asthma
"Gastrointestinal Diarrhoea, anorexia, abdominal pain, nausea"
Others Red runny eyes
"Chiesa- 6% hypertension, 6% hypothyroidism,"
"Estomba Not stated 41 6% asthma, 4% autoimmune disease, 47 ENT 51% olfactory dysfunction"
"(Italy) [21] 3% diabetes, 2% IHD, 1% COPD"
Cousyn
"0 35 Not stated 60 ENT 16.8% olfactory dysfunction, 9.6% gustatory dysfunction"
(France) [22]
"Generalised/MSK 45% fatigue, 15% myalgia, 3% fever, slight pain/discomfort"
"33% dyspnoea, 33% cough. Normal spirometry, normal ABG, reduced DLCO, reduced"
Respiratory
distance on 6MWT
"Neuropsychiatric 18% cognitive issues, 15% headache, mild depression, subthreshold anxiety"
"Daher 59% hypertension, 25% diabetes, 22%"
"ENT 12% olfactory dysfunction, 12% rhinorrhoea, 9% gustatory dysfunction, 9% sore throat"
"(Germany) 100 64 CKD, 19% IHD, 13% asthma, 9% 56"
"9% diarrhoea, 6% nausea, 3% abdominal pain, normal LFTs[17] COPD, 9% AF, 9% heart failureGastrointestinal"
"18% angina, normal left ventricular function, normal right ventricular function, normal cardiac"
Cardiovascular
biomarkers
"Normal FBC, normal coagulation screen, raised ferritin, potentially raised D-dimer, normal"
Other biomarkers
"U&Es, normal CRP, normal procalcitonin, normal TFTs, normal IL-6"
"26% hypertension, 12% diabetes, Generalised/MSK 61.2% fatigue"
Fernandez-
"12% IHD, 7% asthma, 5% obesity, 4%"
de-Las-Penas 100 {7} 61 340
"COPD, 2% cerebrovascular disease, 2% Respiratory 23.3% dyspnoea, 6.5% chest pain, 2.5% cough"
(Spain) [23]
rheumatological disease
"47% hypertension, 42% dyslipidaemia, Generalised/MSK 25% fatigue"
Froidure
"28% obesity, 22% diabetes, 9% Abnormal chest CT: 67% ground glass opacities, 44% reticulations, 20% fibrotic lesions/"
(Belgium) 100 {22} 60 98
"asthma, 4% COPD, 2% lung cancer, Respiratory traction bronchiectasis, 7% consolidations. 46% reduced DLCO, 35% dyspnoea, 10% dry"
[24]
"1% ILD cough, 4% chest tightness, normal spirometry"
2022  •  Vol. 12  •  05014 4 www.jogh.org •  doi: 10.7189/jogh.12.05014
"",,,,,,,Symptoms and signs of long COVID: A rapid review
Table 1. continued,,,,,,,
Author (country) Hospital (%) {ICU (%)},Age (years),,,Comorbidities,Follow-up time (days),,Body system Results
"",,,,,,,Generalised/MSK 17% fatigue
Gerhards,,,,,,,
"",,,,,,,"Neuropsychiatric Depression, concentration issues"
(Germany) 10,46,,,Not stated,183,,
"",,,,,,,ENT 27% olfactory/gustatory dysfunction
[25],,,,,,,
"",,,,,,,Others Alopecia
"",,,,,,,"Generalised/MSK Fatigue, arthralgia, myalgia"
"",,,,"38% hypertension, 22% obesity, 19%",,,
Ghosn,,,,,,,"Respiratory Dyspnoea, cough"
100 {29},61,,,"diabetes, 18% IHD, 10% COPD, 7%",194,,
(France) [26],,,,,,,Neuropsychiatric Headache
"",,,,"CKD, 7% malignancy, 1% liver disease",,,
"",,,,,,,"ENT Rhinorrhoea, olfactory dysfunction, gustatory dysfunction, sore throat"
"",,,,,,,"62% abnormal chest CT: 35% fibrotic-like changes, 27% ground glass opacities/interstitial"
Han (China),,,,"28% hypertension, 14% respiratory",,,"thickening, nodules/masses, interlobar pleural traction, pulmonary atelectasis and"
100,54,,,,175,,Respiratory
[27],,,,"disease, 11% diabetes",,,"bronchiectasis. 26% reduced DLCO, 14% mild dyspnoea, 10% sputum production, 6.1% dry"
"",,,,,,,cough
"",,,,,,,"Generalised/MSK 50% fatigue, 35.7% arthralgia, 21.4% myalgia"
Holmes,,,,,,,"Respiratory 28.6% cough, 25% dyspnoea, 3.6% chest pain"
(Australia) 0,57,,,Not stated,183,,Neuropsychiatric 10.7% headache
[28],,,,,,,"ENT 28.6% olfactory dysfunction, 14.3% rhinorrhoea"
"",,,,,,,Gastrointestinal No abdominal pain
"",,,,"49% obesity, 48% hypertension,",,,"Generalised/MSK 44.8% fatigue, 21.3% myalgia, 15.8% arthralgia, 1.1% fever, 1.1% ulcer"
"",,,,"28% diabetes, 12% IHD, 11%",,,"Respiratory 31.7% dyspnoea, 25.1% cough, 14.8% sputum production"
"",,,,"dyslipidaemia, 10% asthma, 10%",,,"Neuropsychiatric 12.6% headache, 8.7% cognitive issues"
Jacobs (USA),,,,"malignancy, 5% arrhythmia, 4%",,,
100,57,,,,35,,"ENT 9.8% gustatory dysfunction, 9.3% olfactory dysfunction"
[29],,,,"COPD, 4% hypothyroidism, 4%",,,
"",,,,,,,Gastrointestinal 3.8% diarrhoea
"",,,,"depression, anxiety or schizophrenia,",,,
"",,,,"3% heart failure, 3% sleep apnoea, 2%",,,"Others 8.2% eye irritation, 1.1% ulcer"
"",,,,VTE,,,
"",,,,"36% obesity, 29% hypertension,",,,"Generalised/MSK 63% fatigue, 35% myalgia"
Leth,,,,"12% malignancy, 10% IHD, 8%",,,"Respiratory 53% dyspnoea, 24% cough, 20% chest pain, 12% sputum production"
(Denmark) 100 {12},58,,,"asthma, 8% COPD, 4% diabetes, 4%",128,,"Neuropsychiatric 45% concentration issues, 27% headache, 27% paraesthesia"
[30],,,,"hyperthyroidism, 2% cerebrovascular",,,"ENT 31% gustatory dysfunction, 27% olfactory dysfunction, 10% sore throat"
"",,,,disease,,,"Gastrointestinal 10% abdominal pain, 8% diarrhoea, 8% nausea, 4% anorexia"
"",,,,,,,"Generalised/MSK 33% fatigue, 1.4% arthralgia, 0.6% myalgia"
"",,,,,,,"Respiratory 8.5% cough, 7% dyspnoea, 0.8% chest pain"
Mahmud,,,,,,,
"",,,,,,,"3.9% circadian rhythm disorders, 3.4% headache, 2.3% sleep disturbance, 1.4% adjustment"
(Bangladesh) Not stated,40,,,"15% hypertension, 14% diabetes",30,,Neuropsychiatric
"",,,,,,,disorder
[18],,,,,,,
"",,,,,,,"ENT 2.3% vertigo, 2% olfactory dysfunction"
"",,,,,,,Cardiovascular 1.4% palpitation
"",,,,,,,RESEARCH THEME 1:
"",,,,,,,VCOIEVWIDP-O1I9N PTASNDEMIC
www.jogh.org • doi: 10.7189/jogh.12.05014,,,,5,,,2022  •  Vol. 12  •  05014
RVEIESWEAPROCINHT TSHEME 1:  Healey et al. COVID-19 PANDEMIC
Table 1. continued
Author Hospital (%) Age Follow-up
(country) {ICU (%)} (years) Comorbidities time (days) Body system Results
Otte
"42.3% subjective olfactory dysfunction, 26.9% objective olfactory dysfunction (discrimination"
(Germany) 0 45 Not stated 201 ENT
and identification issues)
[31]
"Generalised/MSK 13.1% fatigue, 8.2% rheumatological issues"
"23% hypertension, 16% obesity, 6% Respiratory 6% dyspnoea, 2% cough, 0.8% chest pain"
"Peghin (Italy) diabetes, 4% respiratory disease, 1% Neuropsychiatric 9.6% neurological disorders, 4.9% psychiatric disorders, 2.7% headache"
26 53 191
"[32] IHD, 2% liver disease, 1% depression/ ENT 10.4% olfactory/gustatory dysfunction,"
"anxiety, 0% CKD Gastrointestinal 1.5% gastrointestinal disorders"
"Others 3.7% alopecia, 3.4% cutaneous manifestations, 0.3% ocular symptoms"
"Generalised/MSK 24% night sweats, 0% fever"
"63% abnormal chest CT: ground-glass opacities, reticular lesions, consolidations, bronchial"
"Respiratory dilation. 36% dyspnoea, abnormal spirometry: 22% reduced FVC, 22% reduced FEV1, normal"
"40% cardiovascular disease, 30% FEV1/FVC. 21% reduced DLCO, 17% cough"
"hypertension, 19% dyslipidaemia,"
Sonnweber Neuropsychiatric 22% sleep disorders
"75 57 17% diabetes, 7% asthma, 7% CKD, 103"
(Austria) [16] ENT 19% olfactory dysfunction
"6% COPD, 6% liver disease, 6%"
"malignancy, 1% ILD Gastrointestinal 9% diarrhoea/vomiting"
"97% normal LVEF, 55% diastolic dysfunction on echo, 23% raised NT-proBNP, 10%"
Cardiovascular
"pulmonary hypertension, 1% pericardial effusion"
"Other biomarkers Raised D-dimer, potentially raised ferritin, normal CRP, normal procalcitonin, normal IL-6"
"Generalised/MSK Fatigue, myalgia, fever"
"Respiratory Dyspnoea, cough, chest pain"
"Sudre (UK, 26% obesity, 14% respiratory disease, Neuropsychiatric Headache, paraesthesia, numbness, concentration/ memory issues"
"USA, Sweden) 14 42 10% asthma, 3% diabetes, 2% IHD, 84"
"[33] 1% CKD ENT Olfactory dysfunction, sore throat, hoarse voice, tinnitus, earache"
"Gastrointestinal Diarrhoea, abdominal pain"
Cardiovascular Palpitations/tachycardia
"Vaira (Italy) 29% obesity, 27% IHD, 15%"
"23 51 60 ENT 21% olfactory dysfunction, 7.9% gustatory dysfunction"
"[34] respiratory disease, 11% diabetes"
"ICU – intensive care unit, IHD – ischaemic heart disease, AF – atrial fibrillation, COPD – chronic obstructive pulmonary disease, CKD – chronic kidney disease, VTE – venous thromboembolism, IBD – inflammatory bowel"
"disease, NS – not stated, ILD – interstitial lung disease, MSK – musculoskeletal, ENT – ear, nose, and throat, OGD – olfactory-gustatory dysfunction, DLCO – diffusing capacity for carbon monoxide, PTSD – posttraumatic"
"stress disorder, ABG – arterial blood gas, 6MWT – 6-min walk test, LFT – liver function test, FBC – full blood count, U&E – urea and electrolyte, CRP – c-reactive protein, TFT – thyroid function test, IL-6 – interleukin-6,"
"FVC – forced vital capacity, FEV1 – forced expiratory volume in one second, NT-proBNP – N-terminal pro B-type natriuretic peptide"
2022  •  Vol. 12  •  05014 6 www.jogh.org •  doi: 10.7189/jogh.12.05014
Short summary

The PDF scraping exercise only worked to a certain degree2, as the data did not arrive in a proper tabular format. I’ve also gone on to read several online resources and looked into tabula-py and tabula-java, it was clearly shown in their GitHub repo that there were existing issues for tables that have merged cells, empty cells or no column lines (which was what I had in this case). All of them tend to result in jumbled or merged rows or columns. It tends to work better if the tables in the PDFs are already in a proper table format i.e. columns and rows marked by lines. Nevertheless, the purpose of scraping the table data was achieved as full data were there after checking, but just not in a clean and tidy state so the next post named, “Long COVID - an update” would take us into the next stage to see what this tabular data would tell us about long COVID (all done in R).

Footnotes

  1. Thanks to Stack Overflow as I’ve managed to find this solution from several different scenarios and comments.↩︎

  2. or it could be my ignorance to other better methods - please leave a comment as I’d like to learn!↩︎