Overview
During the height of the COVID-19 pandemic in the summer of 2021, I analyzed the sentiment on COVID-19 vaccines in Canada by applying machine learning and natural language processing (NLP) techniques to tweets on COVID-19. This post documents it.
Summary
- I developed classification models that classify COVID-19 vaccine-related tweets into three sentiment classes, positive, neutral, and negative, using scikit-learn. The models classified the sentiments of the tweets with 69% accuracy for Canada and 84% accuracy for the world.
- I applied NLP techniques such as lemmatization to extract vaccine-related tweets using NLTK.
- I extracted and analyzed n-grams related to vaccines from vaccine-related tweets.
- Canada and worldwide data show more positive and neutral sentiments toward COVID-19 vaccines.
Problem Description
More and more people have been vaccinated and protected from COVID-19. Although normalcy gradually returns, some hesitancy exists around COVID-19 vaccines. Understanding the reluctance and acceptance of COVID-19 vaccines is key to preventing and mitigating the resurgence of the COVID-19 pandemic during the post-pandemic recovery.
Project Goal
During the pandemic, people have posted tweets about COVID-19 vaccines, including positive and negative messages.
Through machine learning and NLP techniques, I aim to:
- Predict whether the tweets are positive, neutral, or negative
- Understand the public opinion and sentiment on COVID-19 vaccines
- Provide insights on the hesitancy and acceptance of COVID-19 vaccines
Code
My code can be found here:
Dataset
The following is the dataset that I used:
Dataset | COVID-19 Geo-Tagged Tweets Dataset at IEEE DataPort |
Author | Rabindra Lamsal |
Data | COVID-19 related tweet IDs witout tweet messages (I hydrated the IDs to get complete data.) |
Dataset Overview | 402,970 tweets 478 CSV files Global coverage English March 20, 2020 to July 11, 2021 |
Hydrating tweet IDs is the process of getting complete details of tweets using the Twitter API.
Machine Learning Workflow
The following describes the overview of my machine learning workflow for the project:
Machine Learning Workflow
Exploratory Data Analysis
I plotted a word cloud to see the most prominent words in the processed data containing vaccine-related tweets. The following word cloud shows words such as COVID-19, vaccine, dose, pandemic, and AstraZeneca appear in vaccine-related tweets.
Word Cloud for COVID-19 Vaccine-Related Tweets in Canada
I visualized n-grams to see words and phrases associated with COVID-19 vaccine-related tweets. Click on each of the following images to enlarge it.
Canada
Worldwide
My summary of the analysis is as follows:
- Positive Sentiments
- People feel optimistic about their first and second doses of the vaccine.
- People feel happy and safe about the vaccine.
- People show excitement about making appointments for vaccines.
- People are optimistic about the availability of vaccines.
- People accept COVID-19 vaccines and are not hesitant about getting their doses.
- Negative Sentiments
- A small number of people are pessimistic about COVID-19 vaccines.
- A small number of people are concerned about side effects.
- A small number of people are pessimistic about herd immunity, but this is not a negative sentiment toward vaccines.
- A small number of people feel COVID-19 mutilations could render vaccines ineffective.
Modelling
I divided the processed data into a train set and a test set. I trained and validated my models on the train set using k-fold cross-validations. Then I evaluated the models on the test set.
Model Training
Canada | Worldwide | |
---|---|---|
Processed Data Size | 1,108 | 23,693 |
Train-Test Split | 70%-30% | |
X (Features) | Vectorized Tweets | |
y (Target Values) | Sentiment Labels (Positive, Neutral, Negative) | |
Models | Logistic Regression Naive Bayes Linear SVC (Support Vector Classifier) Decision Tree | Linear SVC |
Model Testing
Cross-Validation
I performed 10-fold cross-validations on classification models and computed the mean accuracy, precision, and recall scores for Canada and the world, as shown below:
Canada
Model | Mean Accuracy | Mean Precision | Mean Recall |
---|---|---|---|
Logistic Regression | 0.7084 | 0.7084 | 0.6803 |
Naive Bayes | 0.6217 | 0.6217 | 0.6054 |
Linear SVC | 0.7149 | 0.7149 | 0.7129 |
Decision Tree | 0.6968 | 0.7006 | 0.7001 |
Worldwide
Model | Mean Accuracy | Mean Precision | Mean Recall |
---|---|---|---|
Linear SVC | 0.8473 | 0.8473 | 0.8466 |
Testing Model on Test Set
I evaluated my Linear SVC models on the test set. The models predict the sentiments of vaccine-related tweets with 69% accuracy for Canada and 84% accuracy for the world, as shown below:
Canada
Precision | Recall | F1 Score | # of Samples | |
---|---|---|---|---|
Negative | 0.23 | 0.10 | 0.14 | 30 |
Neutral | 0.60 | 0.77 | 0.67 | 118 |
Positive | 0.81 | 0.74 | 0.77 | 185 |
Accuracy | 0.69 | 333 | ||
Macro AVG | 0.55 | 0.54 | 0.53 | 333 |
Weighted AVG | 0.68 | 0.69 | 0.68 | 333 |
Confusion Matrix (Canada)
Worldwide
Precision | Recall | F1 Score | # of Samples | |
---|---|---|---|---|
Negative | 0.77 | 0.67 | 0.71 | 995 |
Neutral | 0.81 | 0.88 | 0.84 | 2474 |
Positive | 0.89 | 0.87 | 0.88 | 3639 |
Accuracy | 0.84 | 7108 | ||
Macro AVG | 0.82 | 0.81 | 0.81 | 7108 |
Weighted AVG | 0.84 | 0.84 | 0.84 | 7108 |
Confusion Matrix (Worldwide)
Results and Insights Gained
My linear SVC models for Canada and the world classify COVID-19 vaccine-related tweets into three sentiment classes with 69% and 84% accuracy, respectively. The model for Canada has low precision (27%) and recall (13%) for the Negative sentiment class, and this is likely due to the small volume of data (1108 tweets) after data processing.
Based on my analysis, Canada and worldwide data show more positive and neutral sentiments toward COVID-19 vaccines. Most people are not skeptical about COVID-19 vaccines and are not hesitant about getting their doses. A few people have shown negativity around COVID-19 vaccines and their side effects.