COVID-19 SENTIMENT ANALYSIS

- Motivation -

Despite the profound challenges posed by the COVID-19 pandemic, marked by widespread job losses, stringent quarantine measures, and heartbreaking losses, humanity has demonstrated remarkable resilience and solidarity. Even amid these unprecedented circumstances, individuals persist in offering support in various forms. Leveraging Natural Language Processing (NLP) techniques and the Naïve Bayes Classifier, I embarked on a project aimed at discerning the sentiment expressed within Twitter/X posts based solely on their textual content.

How do we classify a post as negative, neutral, or positive? The primary determinant is the sentiment conveyed within the message (hence the term Sentiment Analysis). If the content expresses anger, animosity, or unfortunate news, it is reasonable to categorize it as negative. Conversely, if the message conveys optimism, encouragement, or positive developments, it can be categorized as positive. Posts falling in between these extremes, encompassing neutral or inconsequential information, are typically classified as neutral.

- Application -

The first thing that we need to do if find a dataset of tweets that are discussing the COVID-19 virus. Kaggle has a large assortment of datasets that fit this criteria. The particular dataset that I found contains ~45000 records and 6 features. The independent features include “UserName”, “ScreenName”, “Location”, “TweetAt”, and “OriginalTweet” while the dependent feature is the “Sentiment”.

Now that we have our dataset, we can begin cleaning the data to receive a more accurate final result. Firstly, we need to begin by dropping unnecessary features. Some obvious ones would be the “UserName” and “ScreenName”. There would be no correlation between these features and the Sentiment. Now since we plan on utilizing Multinomial Naïve Bayes, the features are assumed to be unrelated to each other. This means that we only have to focus on one feature, and the most prominent one being the “OrginalTweet”, therefore we will drop the “Location” and “TweetAt” features as well. So finally, we are left with one independent feature, “OriginalTweet”, and one dependent feature, “Sentiment”.

The sentiment of each tweet is defined as either Extremely Negative, Negative, Neutral, Positive, or Extremely Positive. Since we mostly care about the general sentiment of each tweet, we do not need as much specificity, therefore we can combine the Extremely Negative and Negative tweets into one sentiment called Negative. Similarly, we can combine the Extremely Positive and Positive tweets into one sentiment called Positive. At the same time, we can map Negative, Neutral, and Positive to 0, 1, and 2 respectively. Our dataset should now look like this:

Now begins the cleaning process. We need to ensure that each tweet only contains alphanumeric (consisting of or using both letters and numerals) characters. We also need our dataset to contain no callouts/mentions (aka ‘@’), URLs, and hashtags ‘#’. Removing hashtags will help later when utilizing the bag-of-words model. To do this, we can use a helpful library called re-Regular Expression operations. This will help us locate specific patters within strings to be altered or removed. We will also utilize numpy’s vectorize function which will perform array operations without an extensive use of for loops, ultimately speeding up data processing. We will also add a new feature called “KeyTweet” where we can store all of our altered tweets. Our dataset should now look like this:

We will now make the tweets more impactful by removing all words containing 3 or less letters. The main purpose of this is to remove stop words within each tweet. Stop words are words that are widely used but contain very little useful information. Some common stop words are “a”, “for” , “at”, “in”, “to”, etc. Additionally, we will tokenize the dataset by stemming the tweets. Stemming is the process of returning words to their root word. For example, the stem of “sleeping”, “slept”, “sleeps”, “sleepy” is “sleep”. This will greatly reduce variance within the system. We will also drop any duplicate records as well as any empty records. We can now visualize the frequency of each word within the dataset:

We can also visualize the frequency of each word within a wordcloud:

Finally, we will be using Scikit-Learns train_test_split function to split 20% of the dataset into testing data and 80% into training. We will now utilize Scikit-Learns MultinomialNB function to perform Multinomial Naïve Bayes algorithm to the datasets. After this, we can visualize the errors of the model by using a confusion matrix which compares the predicted labels to the true labels.

Based on this data, we can conclude that:

Multinomial Naïve Bayes received an accuracy of 66.8%.