Home COVID-19 Transmission Analysis with Topic Modelling
Post
Cancel

COVID-19 Transmission Analysis with Topic Modelling

Overview

The COVID-19 pandemic has negatively affected many British Columbians and resulted in business closures and job losses, including mine. In April 2022, while most British Columbians were fully vaccinated and restrictions were eased, the global pandemic was still ongoing, and the latest COVID-19 variant, Omicron was spreading worldwide. I started this project to learn more about COVID-19 transmission during the pandemic and completed it in April 2022.

The project focused on implementing a machine learning solution which includes data preprocessing, data ingestions, data analysis and modelling to provide insights on COVID-19 transmission and answer such questions as How does COVID-19 spread? and What are the measures to reduce COVID-19 transmission?.

Summary

  • I configured Apache Airflow and wrote Python scripts to process and ingest the COVID-19 Open Research Dataset (CORD-19) dataset into Elasticsearch.
  • I configured Elasticsearch to store over 300,000 COVID-19 scholarly articles on COVID-19 and other coronaviruses.
  • I developed an LDA topic model to group the data into nine topics of COVID-19 transmission using the Gensim Python library.
  • I developed a Word2Vec model to learn words associated with different topics of COVID-19 transmission using Gensim.
  • I developed a cosine similarity-based recommendation system that recommends scholarly articles on COVID-19 transmission using scikit-learn.

Code

My code can be found here:

Dataset

This project uses the COVID-19 Open Research Dataset (CORD-19) provided by Allen Institute for AI (AI2). As of April 2022, CORD-19 contained over 360,000+ scholarly articles on COVID-19 and other coronaviruses, such as Severe Acute Respiratory Syndrome (SARS) and Middle East respiratory syndrome (MERS). The dataset contains articles in JSON format.

Exploratory Data Analysis

I plotted a word cloud to see the most prominent words in the processed data containing COVID-19 transmission-related articles. The following word cloud shows words related to COVID-19 transmission in the data:

World Cloud for COVID-19 Transmission-Related Articles World Cloud for COVID-19 Transmission-Related Articles

I created and visualized n-grams in the data to see words and phrases associated with COVID-19 transmission. Click on each of the following images to enlarge it.







As expected, unigrams such as transmission and COVID-19 appear most frequently in COVID-19 transmission-related articles. Bigrams such as social distancing and airborne transmission are also commonly seen in the data. Trigrams on control measures such as personal protective equipment, public health measures, and nonpharmaceutical interventions npis frequently appear in the data.

I also see terms often used in epidemiology, including index cases, basic reproduction numbers, and secondary attack rates.

  • An index case is the first documented patient in an outbreak and frequently appears in the data.
  • A basic reproduction number in the figure is related to epidemic modelling.
  • A secondary attack rate in the figure refers to the spread of disease in a family or household.

Other terms that frequently appear in the data are related to:

  • Transmission Modes/Routes
    • airborne transmission, aerosol transmission, community transmission, household transmission
  • Prevention Measures
    • use face mask, wear face mask
  • Modelling
    • mathematical model, confidence interval ci

Topic Modelling

Topic Model Evaluation

The coherence score in topic modelling can be used to measure how interpretable the topics are to humans, and you want the score to be as high as possible. I developed 17 topic models that ranged from 2 to 18 topics and computed the coherence score for each. As shown in the figure below, the topic model with nine topics has the highest coherence score of 0.5403. Therefore, nine is the most optimal number of topics.

LDA Coherence Scores LDA Coherence Scores

Once the topic model with the highest coherence score was identified, I printed and visualized each topic in the model. As topic modelling is unsupervised learning, the model groups data into clusters with no labels. So, I inferred each topic in the model to understand what topics there are in the data.

Topic Model Analysis

As discussed in the previous section, the topic model with nine topics has the highest coherence score. The table below shows the ten most notable terms of each of the nine topics output by my model:

Topic10 Most Notable Terms
1variant, vaccine, vaccination, mutation, viral, sequence, genome, delta, spike, virus
2model, disease, epidemic, number, dynamic, outbreak, data, spread, individual, control
3virus, viral, human, cell, animal, host, coronavirus, protein, tgev, respiratory
4contact, case, household, school, test, asymptomatic, secondary, risk, child, rate
5test, positive, sample, vertical, mother, woman, case, vertical_transmission, swab, neonate
6droplet, aerosol, air, airborne, risk, ventilation, particle, model, respiratory, rate
7virus, mask, risk, health, surface, pandemic, spread, public, evidence, disease
8case, number, model, social, data, rate, intervention, measure, country, population
9patient, case, hospital, care, healthcare, worker, outbreak, risk, positive, test

I visualized the topic model using PyLDAvis, as shown below. Hover over or click on each topic to see its top 30 most salient terms.


In summary, by looking at each topic in the visualization and table above, I infer that the topics are as follows:

TopicTopic Description
1Variants and Vaccines
2Epidemic Modelling
3Virus Hosts (e.g., Animals, Humans)
4Household Transmission
5Vertical Transmission (Mother-to-Child Transmission)
6Modes of Transmission (e.g., Droplets, Aerosols)
7Fomite Transmission (Transmission Through Surfaces or Objects)
8Prevention Measures (e.g., Lockdowns, Social Distancing)
9Nosocomial Transmission (Hospital Transmission)

My topic model discovered these topics in the COVID-19 transmission-related articles. Topic modelling can reveal topics that you are not aware of. Before working on the project, I had never heard of vertical transmission.

Word2Vec

Word2Vec is a technique to learn word associations from a collection of texts. I developed a Word2Vec model to learn words associated with COVID-19 transmission-related keywords. I tested it with keywords such as airborne transmission, vertical transmission, household transmission, and omicron variant, as shown below:

Word2Vec Word2Vec

I can learn from the results of the model that:

  • Airborne transmission is related to indoor spaces and aerosol droplets.
  • Vertical transmission is connected to pregnant women, cord blood, and breast milk.
  • Household transmission is related to household members and contacts, and educational settings such as school.
  • The Omicron variant is a new variant and a variant of concern. It is highly transmissible.

I cannot answer all the questions on COVID-19 transmission through the Word2Vec model, but it helps me learn more about COVID-19 transmission.

Recommendation System

I developed a cosine similarity-based recommendation system to recommend scholarly articles on COVID-19 transmission. It makes recommendations based on the similarity between the COVID-19 transmission keyword that I entered and the COVID-19 article data.

Vertical Transmission

The figure below shows the most recommended article for COVID-19 vertical transmission, which shows that:

  • Vertical transmission is a topic of debate.
  • 23 out of 390 neonates reported in studies potentially had vertical transmission.
  • Vertical transmission was possible via uterus or breast milk.

Recommended Article on COVID-19 Vertical Transmission Recommended Article on COVID-19 Vertical Transmission

Fomite Transmission

The figure below shows the most recommended article for COVID-19 fomite transmission, which indicates that:

  • There is an increase in cases of indirect transmission through shared items such as commonly touched surfaces and door handles.
  • Preventive measures such as cleaning and disinfecting surfaces are essential to lower the spread of COVID-19.

Recommended Article on COVID-19 Fomite Transmission Recommended Article on COVID-19 Fomite Transmission

Household Transmission

The figure below shows the top 4 recommended articles for COVID-19 household transmission that indicate that:

  • High rates of household transmission of COVID-19 are found.
  • Household members who sleep in the same room have an increased risk of contracting COVID- 19.
  • Households who can isolate have lower rates of COVID-19 transmission.
  • Monitoring school reopening and isolation of cases are essential to lower COVID-19 transmission.

Recommended Articles on COVID-19 Household Transmission Recommended Articles on COVID-19 Household Transmission

Omicron Variant Transmissibility

The figure below shows the top 3 recommended articles for COVID-19 Omicron variant transmissibility that show that:

  • Omicron is highly transmissible and more transmissible than Delta.
  • Omicron is spreading faster than any previous variant.
  • Vaccination is critical to suppress Omicron.
  • Existing vaccines are less effective against Omicron.

Recommended Articles on COVID-19 Omicron Variant Transmissibility Recommended Articles on COVID-19 Omicron Variant Transmissibility

As shown in the figures above, reading the articles recommended by the recommendation system helps answer questions on COVID-19 transmission.

This post is licensed under CC BY 4.0 by the author.
Contents