library(tidyverse)
library(lubridate)
library(ggplot2)
library(knitr)
Initial data wrangling
This dataset was also derived and downloaded from Orphanet, as another part in the “rare diseases” series. It contained 37 columns with 112,243 rows originally, which took quite a long time to load on RStudio (or could be due to my laptop capacity…). It loaded relatively faster on Jupyter notebook from Anaconda, so I then decided to clean it up first using Python1 there. Some columns were removed which reduced the total number of columns from 37 to 13, while not changing any of the rows at all. The columns were also renamed to make it easier to read.
Source of dataset
Orphadata: Free access data from Orphanet. © INSERM 1999. Available on http://www.orphadata.org. Data version (XML data version). Dataset (.xml file) from http://www.orphadata.org/cgi-bin/epidemio.html. Latest date of update for the dataset: 14/6/2022 (last accessed 24/7/2022). Creative Commons Attribution 4.0 International.
Photo by Sangharsh Lohakare on Unsplash
The following libraries were used for the exploratory data analysis:
Read imported .csv file after data cleaning in Python.
<- read_csv("rare_disease_phenotypes.csv") df
Rows: 112243 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Disorder group, Disorder type, Diagnostic criteria, HPO frequency...
dbl (2): HPO disorder & clinical entity association count, Disorder Orphacode
dttm (1): Validation date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: HPO = human phenotype ontology
spec(df)
cols(
`Disorder group` = col_character(),
`Disorder type` = col_character(),
`HPO disorder & clinical entity association count` = col_double(),
`Diagnostic criteria` = col_character(),
`HPO frequency` = col_character(),
`HPO ID` = col_character(),
`Preferred HPO term` = col_character(),
`Disorder name` = col_character(),
`Disorder Orphacode` = col_double(),
Online = col_character(),
Source = col_character(),
`Validation date` = col_datetime(format = ""),
`Validation status` = col_character()
)
Exploratory data analysis
Since I wasn’t intending for this project2 to be extremely long (as most people would likely lose interests by then), I’d like to first ask a question about the dataset, in order to keep it at a reasonably short but informative length. So, here’s the question: what are the most common rare disorders and their associated phenotypic features?
To answer it, let’s observe the spread of the disorder groups and types first by formulating a contingency table.
<- df %>%
df_type group_by(`Disorder group`,`Disorder type`) %>%
summarise(Number = n())
df_type
# A tibble: 11 × 3
# Groups: Disorder group [3]
`Disorder group` `Disorder type` Number
<chr> <chr> <int>
1 Disorder Biological anomaly 41
2 Disorder Clinical syndrome 661
3 Disorder Disease 57920
4 Disorder Malformation syndrome 37634
5 Disorder Morphological anomaly 2644
6 Disorder Particular clinical situation in a disease or syn… 418
7 Group of disorders Category 479
8 Group of disorders Clinical group 952
9 Subtype of disorder Clinical subtype 7394
10 Subtype of disorder Etiological subtype 4060
11 Subtype of disorder Histopathological subtype 40
After a quick view on the column of “Disorder group”, it mainly provided different disorder types a group label for each, which to a certain extent, was not necessary at this early stage. So this column was removed for now from the contingency table, in order to focus solely on, “Disorder type” with the number of counts (or times it appeared in the dataset).
<- df %>%
df_type group_by(`Disorder type`) %>%
summarise(Number = n())
df_type
# A tibble: 11 × 2
`Disorder type` Number
<chr> <int>
1 Biological anomaly 41
2 Category 479
3 Clinical group 952
4 Clinical subtype 7394
5 Clinical syndrome 661
6 Disease 57920
7 Etiological subtype 4060
8 Histopathological subtype 40
9 Malformation syndrome 37634
10 Morphological anomaly 2644
11 Particular clinical situation in a disease or syndrome 418
Then to visualise this in a graphic way, a lollypop chart was built horizontally, with different types of rare disorders on the y-axis and the number of each type on the x-axis.
ggplot(data = df_type, aes(x = `Disorder type`, y = `Number`)) +
geom_segment(aes(x = `Disorder type`, xend = `Disorder type`, y = 0, yend = `Number`), colour = "dark blue") +
geom_point(colour = "dark green", size = 2, alpha = 0.6) +
theme_light() +
coord_flip()
Two disorder types stood out the most, with “Disease” type appeared 57,920 times and “Malformation syndrome” at 37,634 times. To understand further what each of these two disorder types were, a direct reference3 was used. According to the source of the dataset:
The definition of “Disease” in the rare disorder context was “a disorder with homogeneous therapeutic possibilities and an identified physiopathological mechanism…”, one thing also worth noting was that this type did not include any developmental anomalies.
For “Malformation syndrome”, this was defined as, “A disorder resulting from a developmental anomaly involving more than one morphogenetic field. Malformative sequences and associations are included.”
To demonstrate this in a tabular form, with corresponding proportions of each disorder type in the dataset, the following code were used:
<- df %>%
df1 group_by(`Disorder type`) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n))
df1
# A tibble: 11 × 3
`Disorder type` n prop
<chr> <int> <dbl>
1 Biological anomaly 41 0.000365
2 Category 479 0.00427
3 Clinical group 952 0.00848
4 Clinical subtype 7394 0.0659
5 Clinical syndrome 661 0.00589
6 Disease 57920 0.516
7 Etiological subtype 4060 0.0362
8 Histopathological subtype 40 0.000356
9 Malformation syndrome 37634 0.335
10 Morphological anomaly 2644 0.0236
11 Particular clinical situation in a disease or syndrome 418 0.00372
The table was then rearranged with proportions in descending order (from highest to lowest). It also showed the top two were “Disease” (51.6%) and “Malformation syndrome” (33.5%).
%>% arrange(desc(prop)) df1
# A tibble: 11 × 3
`Disorder type` n prop
<chr> <int> <dbl>
1 Disease 57920 0.516
2 Malformation syndrome 37634 0.335
3 Clinical subtype 7394 0.0659
4 Etiological subtype 4060 0.0362
5 Morphological anomaly 2644 0.0236
6 Clinical group 952 0.00848
7 Clinical syndrome 661 0.00589
8 Category 479 0.00427
9 Particular clinical situation in a disease or syndrome 418 0.00372
10 Biological anomaly 41 0.000365
11 Histopathological subtype 40 0.000356
Distributions of HPO frequency
This was followed by checking out the distributions of HPO frequency to see which categories had the most and least number of counts.
<- df %>%
df_freq count(`HPO frequency`) %>%
arrange(desc(n))
df_freq
# A tibble: 7 × 2
`HPO frequency` n
<chr> <int>
1 Occasional (29-5%) 41140
2 Frequent (79-30%) 37480
3 Very frequent (99-80%) 25892
4 Very rare (<4-1%) 6414
5 Excluded (0%) 705
6 Obligate (100%) 610
7 <NA> 2
Results for rare disorders with obligate or 100% frequency in patient’s populations were then filtered, showing disorder type, HPO frequency and disorder name. Specifically, I wanted to find out the disorder names associated with the “Disease” disorder type with HPO frequency of “Obligate (100%)”.
<- df %>%
df_freq_ob filter(`Disorder type` == "Disease", `HPO frequency` == "Obligate (100%)") %>%
select(`Disorder type`, `HPO frequency`, `Disorder name`)
df_freq_ob
# A tibble: 404 × 3
`Disorder type` `HPO frequency` `Disorder name`
<chr> <chr> <chr>
1 Disease Obligate (100%) Retinoblastoma
2 Disease Obligate (100%) Parathyroid carcinoma
3 Disease Obligate (100%) Pituitary carcinoma
4 Disease Obligate (100%) Familial hypocalciuric hypercalcemia
5 Disease Obligate (100%) Familial hypocalciuric hypercalcemia
6 Disease Obligate (100%) Ravine syndrome
7 Disease Obligate (100%) Ravine syndrome
8 Disease Obligate (100%) Interstitial granulomatous dermatitis with a…
9 Disease Obligate (100%) Interstitial granulomatous dermatitis with a…
10 Disease Obligate (100%) PLIN1-related familial partial lipodystrophy
# … with 394 more rows
# ℹ Use `print(n = ...)` to see more rows
I’d then like to look into associated counts of appearance of each disorder name. When I cross-checked with the full dataset in table view, I’ve noted that the number of appearance of each disorder name is linked to the number of preferred HPO phenotype terms for each of these disorder types.
<- df_freq_ob %>%
df2 count(`Disorder name`)
%>% arrange(desc(n)) df2
# A tibble: 239 × 2
`Disorder name` n
<chr> <int>
1 Autosomal recessive complex spastic paraplegia due to Kennedy pathway … 10
2 STT3A-CDG 9
3 STT3B-CDG 9
4 Spastic paraplegia-Paget disease of bone syndrome 8
5 Oculocutaneous albinism type 5 7
6 PLIN1-related familial partial lipodystrophy 7
7 Plummer-Vinson syndrome 5
8 SSR4-CDG 5
9 Cholesterol-ester transfer protein deficiency 4
10 Isolated follicle stimulating hormone deficiency 4
# … with 229 more rows
# ℹ Use `print(n = ...)` to see more rows
To show this, let’s link preferred HPO terms to a disorder name such as this one, “Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction”, which had the “Disease” disorder type with obligate or 100% HPO frequency.
<- df %>%
df_disease filter(`Disorder type` == "Disease", `HPO frequency` == "Obligate (100%)", `Disorder name` == "Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction") %>%
select(`Disorder type`, `HPO frequency`, `Disorder name`, `Preferred HPO term`)
kable(df_disease)
Disorder type | HPO frequency | Disorder name | Preferred HPO term |
---|---|---|---|
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Progressive spastic paraplegia |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Microcephaly |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Moderately short stature |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Nasal, dysarthic speech |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Delayed gross motor development |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Progressive spasticity |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Lower limb hyperreflexia |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Ankle clonus |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Retinal pigment epithelial mottling |
Disease | Obligate (100%) | Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction | Progressive spastic paraparesis |
As shown in the dataframe above, under the column name, “Preferred HPO term”, there were a total of ten different HPO phenotype terms associated with this particular rare disease with 100% HPO frequency within the patient population for this specific type of spastic paraplegia.
By using similar filtering method, we could quickly narrow down any particular rare disease of interest to find out specific phenotype or clinical features, along with associated HPO phenotype frequency, for further investigations.
For “Malformation syndrome”, a similar search process was used to find out what was the most common phenotypes associated with it.
<- df %>%
df_freq_ma filter(`Disorder type` == "Malformation syndrome", `HPO frequency` == "Obligate (100%)") %>%
select(`Disorder type`, `HPO frequency`, `Disorder name`)
df_freq_ma
# A tibble: 125 × 3
`Disorder type` `HPO frequency` `Disorder name`
<chr> <chr> <chr>
1 Malformation syndrome Obligate (100%) CLAPO syndrome
2 Malformation syndrome Obligate (100%) CLAPO syndrome
3 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
4 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
5 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
6 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
7 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
8 Malformation syndrome Obligate (100%) Weaver-Williams syndrome
9 Malformation syndrome Obligate (100%) Lethal recessive chondrodysplasia
10 Malformation syndrome Obligate (100%) Lethal recessive chondrodysplasia
# … with 115 more rows
# ℹ Use `print(n = ...)` to see more rows
Count() was used to find out the number of appearance of each disorder name in descending order.
<- df_freq_ma %>%
df3 count(`Disorder name`)
%>% arrange(desc(n)) df3
# A tibble: 40 × 2
`Disorder name` n
<chr> <int>
1 Hydrocephalus-obesity-hypogonadism syndrome 12
2 Pelviscapular dysplasia 11
3 46,XX disorder of sex development-skeletal anomalies syndrome 9
4 X-linked microcephaly-growth retardation-prognathism-cryptorchidism sy… 9
5 Severe intellectual disability-hypotonia-strabismus-coarse face-planov… 7
6 Lethal recessive chondrodysplasia 6
7 Weaver-Williams syndrome 6
8 SERKAL syndrome 5
9 Patent ductus arteriosus-bicuspid aortic valve-hand anomalies syndrome 4
10 46,XX gonadal dysgenesis 3
# … with 30 more rows
# ℹ Use `print(n = ...)` to see more rows
To show one of the examples of the most common malformation syndrome with the most associated phenotypic features (with a total of 12 different phenotypic descriptions):
<- df %>%
df_mal_syn filter(`Disorder type` == "Malformation syndrome", `HPO frequency` == "Obligate (100%)", `Disorder name` == "Hydrocephalus-obesity-hypogonadism syndrome") %>%
select(`Disorder type`, `HPO frequency`, `Disorder name`, `Preferred HPO term`)
kable(df_mal_syn)
Disorder type | HPO frequency | Disorder name | Preferred HPO term |
---|---|---|---|
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Hydrocephalus |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Short neck |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Gynecomastia |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Hypergonadotropic hypogonadism |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Intellectual disability, mild |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Obesity |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Mitral valve prolapse |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Low posterior hairline |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | High, narrow palate |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Cubitus valgus |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Short stature |
Malformation syndrome | Obligate (100%) | Hydrocephalus-obesity-hypogonadism syndrome | Short 4th metacarpal |
Explore rare disease validation date
Now, to add one more piece of work towards this exploratory data analysis, I thought to check out the Validation date column. “Validation date” in this context meant the dates when the annotations of HPO terms were made for each rare disorder, which were based on the source articles listed (as shown in the Source column).
Firstly, I started with the “Disease” disorder type and singled out the year component from the Validation date column.
<- df %>%
df_val_date mutate(year = year(`Validation date`), label = TRUE, abbr = FALSE)
df_val_date
# A tibble: 112,243 × 16
Disorder gr…¹ Disor…² HPO d…³ Diagn…⁴ HPO f…⁵ HPO I…⁶ Prefe…⁷ Disor…⁸ Disor…⁹
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Disorder Disease 59 Diagno… Very f… HP:000… Pectus… Marfan… 558
2 Disorder Disease 59 Diagno… Very f… HP:000… Striae… Marfan… 558
3 Disorder Disease 59 Diagno… Very f… HP:000… Arachn… Marfan… 558
4 Disorder Disease 59 Diagno… Very f… HP:000… Dispro… Marfan… 558
5 Disorder Disease 59 Diagno… Very f… HP:000… Pes pl… Marfan… 558
6 Disorder Disease 59 Diagno… Very f… HP:000… Sponta… Marfan… 558
7 Disorder Disease 59 Diagno… Very f… HP:000… Dilata… Marfan… 558
8 Disorder Disease 59 Diagno… Freque… HP:000… Myopia Marfan… 558
9 Disorder Disease 59 Diagno… Freque… HP:000… Dental… Marfan… 558
10 Disorder Disease 59 Diagno… Freque… HP:000… Pectus… Marfan… 558
# … with 112,233 more rows, 7 more variables: Online <chr>, Source <chr>,
# `Validation date` <dttm>, `Validation status` <chr>, year <dbl>,
# label <lgl>, abbr <lgl>, and abbreviated variable names ¹`Disorder group`,
# ²`Disorder type`, ³`HPO disorder & clinical entity association count`,
# ⁴`Diagnostic criteria`, ⁵`HPO frequency`, ⁶`HPO ID`, ⁷`Preferred HPO term`,
# ⁸`Disorder name`, ⁹`Disorder Orphacode`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
To show this in a dataframe, observations for “Disease” disorder type were shown by using a filter:
<- df_val_date %>%
df_val_date_d select(`Disorder type`, year) %>%
filter(`Disorder type` == "Disease")
df_val_date_d
# A tibble: 57,920 × 2
`Disorder type` year
<chr> <dbl>
1 Disease 2016
2 Disease 2016
3 Disease 2016
4 Disease 2016
5 Disease 2016
6 Disease 2016
7 Disease 2016
8 Disease 2016
9 Disease 2016
10 Disease 2016
# … with 57,910 more rows
# ℹ Use `print(n = ...)` to see more rows
Then to make it easier to visualise, the year counts were plotted in a bar graph. Interestingly, 2016 seemed to be the year for rare disorders to be annotated with the most phenotypic features (if referring back to the original dataset, each observation or row was present for a unique “Preferred HPO term” or phenotypic abnormality).
%>%
df_val_date_d ggplot(aes(x = year)) +
geom_bar()
Warning: Removed 49 rows containing non-finite values (stat_count).
It was also worth noting that there were 49 rows of non-finite values excluded from the bar graph above. To look into this, a count on the year column of the dataframe df_val_date_d was done, which confirmed that these were the “NA” or missing values in the validation date column.
%>%
df_val_date_d count(year)
# A tibble: 8 × 2
year n
<dbl> <int>
1 2015 567
2 2016 14193
3 2017 5419
4 2018 6297
5 2019 10525
6 2020 9402
7 2021 11468
8 NA 49
Summary
To quickly summarise key findings from this work4 regarding phenotypes associated with rare diseases:
- Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction was one of the most common rare diseases under the Disease disorder type with the most phenotypic abnormalities recorded, which were:
- progressive spastic paraplegia
- microcephaly
- moderately short stature
- nasal, dysarthic speech
- delayed gross motor development
- progressive spasticity
- lower limb hyperreflexia
- ankle clonus
- retinal pigment epithelial mottling
- progressive spastic paraparesis
- For malformation syndrome of the rare disorder type, Hydrocephalus-obesity-hypogonadism syndrome was found to be one of the most common rare diseases with the most phenotypic abnormalities recorded, which were:
- hydrocephalus
- short neck
- gynecomastia
- hypergonadotropic hypogonadism
- intellectual disability, mild
- obesity
- mitral valve prolapse
- low posterior hairline
- high, narrow palate
- cubitus valgus
- short stature
- short 4th metacarpal
- The year of 2016 had the highest number of HPO terms or phenotypic abnormalities annotated to rare disorders from specific named source articles, and on the contrary, 2015 had the lowest counts from the dataset
Footnotes
Used only for initial data cleaning stage - please see this GitHub link for details. R was used for the rest of the analysis↩︎
This work is under CC BY-SA 4.0 International License if anyone is interested in exploring the dataset further↩︎
“Orphadata: Free access products description” - April 2020. http://www.orphadata.org/cgi-bin/img/PDF/OrphadataFreeAccessProductsDescription.pdf. Version 2↩︎
It’s possible to dig further into the dataset e.g. diagnostic criterion and perhaps even bring back some of the columns removed initially, however due to time constraints (due to being a one-person team and also I’d like to start on the COVID-19 antiviral work soon), I’ll leave some room here for the interested to work on the data↩︎