Phenotypes associated with rare diseases

Data analytics projects
R
Python
Rare diseases
Author

Jennifer HY Lin

Published

August 2, 2022

Initial data wrangling

This dataset was also derived and downloaded from Orphanet, as another part in the “rare diseases” series. It contained 37 columns with 112,243 rows originally, which took quite a long time to load on RStudio (or could be due to my laptop capacity…). It loaded relatively faster on Jupyter notebook from Anaconda, so I then decided to clean it up first using Python1 there. Some columns were removed which reduced the total number of columns from 37 to 13, while not changing any of the rows at all. The columns were also renamed to make it easier to read.

Source of dataset

Orphadata: Free access data from Orphanet. © INSERM 1999. Available on http://www.orphadata.org. Data version (XML data version). Dataset (.xml file) from http://www.orphadata.org/cgi-bin/epidemio.html. Latest date of update for the dataset: 14/6/2022 (last accessed 24/7/2022). Creative Commons Attribution 4.0 International.

Photo by Sangharsh Lohakare on Unsplash

The following libraries were used for the exploratory data analysis:

library(tidyverse)
library(lubridate)
library(ggplot2)
library(knitr)

Read imported .csv file after data cleaning in Python.

df <- read_csv("rare_disease_phenotypes.csv")
Rows: 112243 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (10): Disorder group, Disorder type, Diagnostic criteria, HPO frequency...
dbl   (2): HPO disorder & clinical entity association count, Disorder Orphacode
dttm  (1): Validation date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: HPO = human phenotype ontology

spec(df)
cols(
  `Disorder group` = col_character(),
  `Disorder type` = col_character(),
  `HPO disorder & clinical entity association count` = col_double(),
  `Diagnostic criteria` = col_character(),
  `HPO frequency` = col_character(),
  `HPO ID` = col_character(),
  `Preferred HPO term` = col_character(),
  `Disorder name` = col_character(),
  `Disorder Orphacode` = col_double(),
  Online = col_character(),
  Source = col_character(),
  `Validation date` = col_datetime(format = ""),
  `Validation status` = col_character()
)

Exploratory data analysis

Since I wasn’t intending for this project2 to be extremely long (as most people would likely lose interests by then), I’d like to first ask a question about the dataset, in order to keep it at a reasonably short but informative length. So, here’s the question: what are the most common rare disorders and their associated phenotypic features?

To answer it, let’s observe the spread of the disorder groups and types first by formulating a contingency table.

df_type <- df %>% 
  group_by(`Disorder group`,`Disorder type`) %>% 
  summarise(Number = n())
df_type
# A tibble: 11 × 3
# Groups:   Disorder group [3]
   `Disorder group`    `Disorder type`                                    Number
   <chr>               <chr>                                               <int>
 1 Disorder            Biological anomaly                                     41
 2 Disorder            Clinical syndrome                                     661
 3 Disorder            Disease                                             57920
 4 Disorder            Malformation syndrome                               37634
 5 Disorder            Morphological anomaly                                2644
 6 Disorder            Particular clinical situation in a disease or syn…    418
 7 Group of disorders  Category                                              479
 8 Group of disorders  Clinical group                                        952
 9 Subtype of disorder Clinical subtype                                     7394
10 Subtype of disorder Etiological subtype                                  4060
11 Subtype of disorder Histopathological subtype                              40

After a quick view on the column of “Disorder group”, it mainly provided different disorder types a group label for each, which to a certain extent, was not necessary at this early stage. So this column was removed for now from the contingency table, in order to focus solely on, “Disorder type” with the number of counts (or times it appeared in the dataset).

df_type <- df %>% 
  group_by(`Disorder type`) %>% 
  summarise(Number = n())
df_type
# A tibble: 11 × 2
   `Disorder type`                                        Number
   <chr>                                                   <int>
 1 Biological anomaly                                         41
 2 Category                                                  479
 3 Clinical group                                            952
 4 Clinical subtype                                         7394
 5 Clinical syndrome                                         661
 6 Disease                                                 57920
 7 Etiological subtype                                      4060
 8 Histopathological subtype                                  40
 9 Malformation syndrome                                   37634
10 Morphological anomaly                                    2644
11 Particular clinical situation in a disease or syndrome    418

Then to visualise this in a graphic way, a lollypop chart was built horizontally, with different types of rare disorders on the y-axis and the number of each type on the x-axis.

ggplot(data = df_type, aes(x = `Disorder type`, y = `Number`)) +
  geom_segment(aes(x = `Disorder type`, xend = `Disorder type`, y = 0, yend = `Number`), colour = "dark blue") +
  geom_point(colour = "dark green", size = 2, alpha = 0.6) +
  theme_light() +
  coord_flip() 

Two disorder types stood out the most, with “Disease” type appeared 57,920 times and “Malformation syndrome” at 37,634 times. To understand further what each of these two disorder types were, a direct reference3 was used. According to the source of the dataset:

The definition of “Disease” in the rare disorder context was “a disorder with homogeneous therapeutic possibilities and an identified physiopathological mechanism…”, one thing also worth noting was that this type did not include any developmental anomalies.

For “Malformation syndrome”, this was defined as, “A disorder resulting from a developmental anomaly involving more than one morphogenetic field. Malformative sequences and associations are included.”

To demonstrate this in a tabular form, with corresponding proportions of each disorder type in the dataset, the following code were used:

df1 <- df %>% 
  group_by(`Disorder type`) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n/sum(n))
df1
# A tibble: 11 × 3
   `Disorder type`                                            n     prop
   <chr>                                                  <int>    <dbl>
 1 Biological anomaly                                        41 0.000365
 2 Category                                                 479 0.00427 
 3 Clinical group                                           952 0.00848 
 4 Clinical subtype                                        7394 0.0659  
 5 Clinical syndrome                                        661 0.00589 
 6 Disease                                                57920 0.516   
 7 Etiological subtype                                     4060 0.0362  
 8 Histopathological subtype                                 40 0.000356
 9 Malformation syndrome                                  37634 0.335   
10 Morphological anomaly                                   2644 0.0236  
11 Particular clinical situation in a disease or syndrome   418 0.00372 

The table was then rearranged with proportions in descending order (from highest to lowest). It also showed the top two were “Disease” (51.6%) and “Malformation syndrome” (33.5%).

df1 %>% arrange(desc(prop))
# A tibble: 11 × 3
   `Disorder type`                                            n     prop
   <chr>                                                  <int>    <dbl>
 1 Disease                                                57920 0.516   
 2 Malformation syndrome                                  37634 0.335   
 3 Clinical subtype                                        7394 0.0659  
 4 Etiological subtype                                     4060 0.0362  
 5 Morphological anomaly                                   2644 0.0236  
 6 Clinical group                                           952 0.00848 
 7 Clinical syndrome                                        661 0.00589 
 8 Category                                                 479 0.00427 
 9 Particular clinical situation in a disease or syndrome   418 0.00372 
10 Biological anomaly                                        41 0.000365
11 Histopathological subtype                                 40 0.000356
Distributions of HPO frequency

This was followed by checking out the distributions of HPO frequency to see which categories had the most and least number of counts.

df_freq <- df %>% 
  count(`HPO frequency`) %>% 
  arrange(desc(n))
df_freq
# A tibble: 7 × 2
  `HPO frequency`            n
  <chr>                  <int>
1 Occasional (29-5%)     41140
2 Frequent (79-30%)      37480
3 Very frequent (99-80%) 25892
4 Very rare (<4-1%)       6414
5 Excluded (0%)            705
6 Obligate (100%)          610
7 <NA>                       2

Results for rare disorders with obligate or 100% frequency in patient’s populations were then filtered, showing disorder type, HPO frequency and disorder name. Specifically, I wanted to find out the disorder names associated with the “Disease” disorder type with HPO frequency of “Obligate (100%)”.

df_freq_ob <- df %>% 
  filter(`Disorder type` == "Disease", `HPO frequency` == "Obligate (100%)") %>% 
  select(`Disorder type`, `HPO frequency`, `Disorder name`)
df_freq_ob
# A tibble: 404 × 3
   `Disorder type` `HPO frequency` `Disorder name`                              
   <chr>           <chr>           <chr>                                        
 1 Disease         Obligate (100%) Retinoblastoma                               
 2 Disease         Obligate (100%) Parathyroid carcinoma                        
 3 Disease         Obligate (100%) Pituitary carcinoma                          
 4 Disease         Obligate (100%) Familial hypocalciuric hypercalcemia         
 5 Disease         Obligate (100%) Familial hypocalciuric hypercalcemia         
 6 Disease         Obligate (100%) Ravine syndrome                              
 7 Disease         Obligate (100%) Ravine syndrome                              
 8 Disease         Obligate (100%) Interstitial granulomatous dermatitis with a…
 9 Disease         Obligate (100%) Interstitial granulomatous dermatitis with a…
10 Disease         Obligate (100%) PLIN1-related familial partial lipodystrophy 
# … with 394 more rows
# ℹ Use `print(n = ...)` to see more rows

I’d then like to look into associated counts of appearance of each disorder name. When I cross-checked with the full dataset in table view, I’ve noted that the number of appearance of each disorder name is linked to the number of preferred HPO phenotype terms for each of these disorder types.

df2 <- df_freq_ob %>% 
  count(`Disorder name`) 
df2 %>% arrange(desc(n))
# A tibble: 239 × 2
   `Disorder name`                                                             n
   <chr>                                                                   <int>
 1 Autosomal recessive complex spastic paraplegia due to Kennedy pathway …    10
 2 STT3A-CDG                                                                   9
 3 STT3B-CDG                                                                   9
 4 Spastic paraplegia-Paget disease of bone syndrome                           8
 5 Oculocutaneous albinism type 5                                              7
 6 PLIN1-related familial partial lipodystrophy                                7
 7 Plummer-Vinson syndrome                                                     5
 8 SSR4-CDG                                                                    5
 9 Cholesterol-ester transfer protein deficiency                               4
10 Isolated follicle stimulating hormone deficiency                            4
# … with 229 more rows
# ℹ Use `print(n = ...)` to see more rows

To show this, let’s link preferred HPO terms to a disorder name such as this one, “Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction”, which had the “Disease” disorder type with obligate or 100% HPO frequency.

df_disease <- df %>% 
  filter(`Disorder type` == "Disease", `HPO frequency` == "Obligate (100%)", `Disorder name` == "Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction") %>% 
  select(`Disorder type`, `HPO frequency`, `Disorder name`, `Preferred HPO term`)
kable(df_disease)
Disorder type HPO frequency Disorder name Preferred HPO term
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Progressive spastic paraplegia
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Microcephaly
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Moderately short stature
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Nasal, dysarthic speech
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Delayed gross motor development
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Progressive spasticity
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Lower limb hyperreflexia
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Ankle clonus
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Retinal pigment epithelial mottling
Disease Obligate (100%) Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction Progressive spastic paraparesis

As shown in the dataframe above, under the column name, “Preferred HPO term”, there were a total of ten different HPO phenotype terms associated with this particular rare disease with 100% HPO frequency within the patient population for this specific type of spastic paraplegia.

By using similar filtering method, we could quickly narrow down any particular rare disease of interest to find out specific phenotype or clinical features, along with associated HPO phenotype frequency, for further investigations.

For “Malformation syndrome”, a similar search process was used to find out what was the most common phenotypes associated with it.

df_freq_ma <- df %>% 
  filter(`Disorder type` == "Malformation syndrome", `HPO frequency` == "Obligate (100%)") %>%
  select(`Disorder type`, `HPO frequency`, `Disorder name`)
df_freq_ma
# A tibble: 125 × 3
   `Disorder type`       `HPO frequency` `Disorder name`                  
   <chr>                 <chr>           <chr>                            
 1 Malformation syndrome Obligate (100%) CLAPO syndrome                   
 2 Malformation syndrome Obligate (100%) CLAPO syndrome                   
 3 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 4 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 5 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 6 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 7 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 8 Malformation syndrome Obligate (100%) Weaver-Williams syndrome         
 9 Malformation syndrome Obligate (100%) Lethal recessive chondrodysplasia
10 Malformation syndrome Obligate (100%) Lethal recessive chondrodysplasia
# … with 115 more rows
# ℹ Use `print(n = ...)` to see more rows

Count() was used to find out the number of appearance of each disorder name in descending order.

df3 <- df_freq_ma %>% 
  count(`Disorder name`)
df3 %>% arrange(desc(n))
# A tibble: 40 × 2
   `Disorder name`                                                             n
   <chr>                                                                   <int>
 1 Hydrocephalus-obesity-hypogonadism syndrome                                12
 2 Pelviscapular dysplasia                                                    11
 3 46,XX disorder of sex development-skeletal anomalies syndrome               9
 4 X-linked microcephaly-growth retardation-prognathism-cryptorchidism sy…     9
 5 Severe intellectual disability-hypotonia-strabismus-coarse face-planov…     7
 6 Lethal recessive chondrodysplasia                                           6
 7 Weaver-Williams syndrome                                                    6
 8 SERKAL syndrome                                                             5
 9 Patent ductus arteriosus-bicuspid aortic valve-hand anomalies syndrome      4
10 46,XX gonadal dysgenesis                                                    3
# … with 30 more rows
# ℹ Use `print(n = ...)` to see more rows

To show one of the examples of the most common malformation syndrome with the most associated phenotypic features (with a total of 12 different phenotypic descriptions):

df_mal_syn <- df %>%
  filter(`Disorder type` == "Malformation syndrome", `HPO frequency` == "Obligate (100%)", `Disorder name` == "Hydrocephalus-obesity-hypogonadism syndrome") %>% 
  select(`Disorder type`, `HPO frequency`, `Disorder name`, `Preferred HPO term`)
kable(df_mal_syn)
Disorder type HPO frequency Disorder name Preferred HPO term
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Hydrocephalus
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Short neck
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Gynecomastia
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Hypergonadotropic hypogonadism
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Intellectual disability, mild
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Obesity
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Mitral valve prolapse
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Low posterior hairline
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome High, narrow palate
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Cubitus valgus
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Short stature
Malformation syndrome Obligate (100%) Hydrocephalus-obesity-hypogonadism syndrome Short 4th metacarpal
Explore rare disease validation date

Now, to add one more piece of work towards this exploratory data analysis, I thought to check out the Validation date column. “Validation date” in this context meant the dates when the annotations of HPO terms were made for each rare disorder, which were based on the source articles listed (as shown in the Source column).

Firstly, I started with the “Disease” disorder type and singled out the year component from the Validation date column.

df_val_date <- df %>% 
  mutate(year = year(`Validation date`), label = TRUE, abbr = FALSE)
df_val_date
# A tibble: 112,243 × 16
   Disorder gr…¹ Disor…² HPO d…³ Diagn…⁴ HPO f…⁵ HPO I…⁶ Prefe…⁷ Disor…⁸ Disor…⁹
   <chr>         <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>     <dbl>
 1 Disorder      Disease      59 Diagno… Very f… HP:000… Pectus… Marfan…     558
 2 Disorder      Disease      59 Diagno… Very f… HP:000… Striae… Marfan…     558
 3 Disorder      Disease      59 Diagno… Very f… HP:000… Arachn… Marfan…     558
 4 Disorder      Disease      59 Diagno… Very f… HP:000… Dispro… Marfan…     558
 5 Disorder      Disease      59 Diagno… Very f… HP:000… Pes pl… Marfan…     558
 6 Disorder      Disease      59 Diagno… Very f… HP:000… Sponta… Marfan…     558
 7 Disorder      Disease      59 Diagno… Very f… HP:000… Dilata… Marfan…     558
 8 Disorder      Disease      59 Diagno… Freque… HP:000… Myopia  Marfan…     558
 9 Disorder      Disease      59 Diagno… Freque… HP:000… Dental… Marfan…     558
10 Disorder      Disease      59 Diagno… Freque… HP:000… Pectus… Marfan…     558
# … with 112,233 more rows, 7 more variables: Online <chr>, Source <chr>,
#   `Validation date` <dttm>, `Validation status` <chr>, year <dbl>,
#   label <lgl>, abbr <lgl>, and abbreviated variable names ¹​`Disorder group`,
#   ²​`Disorder type`, ³​`HPO disorder & clinical entity association count`,
#   ⁴​`Diagnostic criteria`, ⁵​`HPO frequency`, ⁶​`HPO ID`, ⁷​`Preferred HPO term`,
#   ⁸​`Disorder name`, ⁹​`Disorder Orphacode`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

To show this in a dataframe, observations for “Disease” disorder type were shown by using a filter:

df_val_date_d <- df_val_date %>% 
  select(`Disorder type`, year) %>% 
  filter(`Disorder type` == "Disease")
df_val_date_d
# A tibble: 57,920 × 2
   `Disorder type`  year
   <chr>           <dbl>
 1 Disease          2016
 2 Disease          2016
 3 Disease          2016
 4 Disease          2016
 5 Disease          2016
 6 Disease          2016
 7 Disease          2016
 8 Disease          2016
 9 Disease          2016
10 Disease          2016
# … with 57,910 more rows
# ℹ Use `print(n = ...)` to see more rows

Then to make it easier to visualise, the year counts were plotted in a bar graph. Interestingly, 2016 seemed to be the year for rare disorders to be annotated with the most phenotypic features (if referring back to the original dataset, each observation or row was present for a unique “Preferred HPO term” or phenotypic abnormality).

df_val_date_d %>% 
  ggplot(aes(x = year)) +
  geom_bar()
Warning: Removed 49 rows containing non-finite values (stat_count).

It was also worth noting that there were 49 rows of non-finite values excluded from the bar graph above. To look into this, a count on the year column of the dataframe df_val_date_d was done, which confirmed that these were the “NA” or missing values in the validation date column.

df_val_date_d %>% 
  count(year)
# A tibble: 8 × 2
   year     n
  <dbl> <int>
1  2015   567
2  2016 14193
3  2017  5419
4  2018  6297
5  2019 10525
6  2020  9402
7  2021 11468
8    NA    49

Summary

To quickly summarise key findings from this work4 regarding phenotypes associated with rare diseases:

  • Autosomal recessive complex spastic paraplegia due to Kennedy pathway dysfunction was one of the most common rare diseases under the Disease disorder type with the most phenotypic abnormalities recorded, which were:
  1. progressive spastic paraplegia
  2. microcephaly
  3. moderately short stature
  4. nasal, dysarthic speech
  5. delayed gross motor development
  6. progressive spasticity
  7. lower limb hyperreflexia
  8. ankle clonus
  9. retinal pigment epithelial mottling
  10. progressive spastic paraparesis
  • For malformation syndrome of the rare disorder type, Hydrocephalus-obesity-hypogonadism syndrome was found to be one of the most common rare diseases with the most phenotypic abnormalities recorded, which were:
  1. hydrocephalus
  2. short neck
  3. gynecomastia
  4. hypergonadotropic hypogonadism
  5. intellectual disability, mild
  6. obesity
  7. mitral valve prolapse
  8. low posterior hairline
  9. high, narrow palate
  10. cubitus valgus
  11. short stature
  12. short 4th metacarpal
  • The year of 2016 had the highest number of HPO terms or phenotypic abnormalities annotated to rare disorders from specific named source articles, and on the contrary, 2015 had the lowest counts from the dataset

Footnotes

  1. Used only for initial data cleaning stage - please see this GitHub link for details. R was used for the rest of the analysis↩︎

  2. This work is under CC BY-SA 4.0 International License if anyone is interested in exploring the dataset further↩︎

  3. “Orphadata: Free access products description” - April 2020. http://www.orphadata.org/cgi-bin/img/PDF/OrphadataFreeAccessProductsDescription.pdf. Version 2↩︎

  4. It’s possible to dig further into the dataset e.g. diagnostic criterion and perhaps even bring back some of the columns removed initially, however due to time constraints (due to being a one-person team and also I’d like to start on the COVID-19 antiviral work soon), I’ll leave some room here for the interested to work on the data↩︎