The Sentiment Crystal Ball: Harnessing Customer Tweets for Predictive Data Analytics
A case study project that categorizies ‘future’ tweets into positive and negative sentiments
Customer sentiment feedback is crucial for businesses as it provides insights into the customer satisfaction, drives product and service improvements,enhances competitiveness, builds brand reputation,guides innovation, and supports customer-centric decision making.With the aid of predictive data analytics through the use of its customer’s social media posts and product reviews,companies can automatically predict the underlying sentiment experienced when such customers engages its product or service
Context
This article is written as a log of work for a data analytics project which utilizes Natural Language Processing (NLP) to create a model that categories tweets into positive and negative sentiments,the Naive Bayes Classification model was developed and its performance was verified at the end of the project. The data utilized for this project is available at: https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech/download?datasetVersionNumber=1
Project objective : Create a system of categorizing ‘future’ tweets into positive and negative sentiments based on some pre-classified tweets.
The achievement of this seemingly complex project was made possible by its simplification into a couple of simpler basic tasks which are elucidated below:
Importing libraries and loading datasets
For this project, seven different libraries were imported and employed .These libraries include pandas for efficient and flexible handling of structured data,numpy utilized for scientific computing and numerical operations,seaborn combined with matplotlib for creating informative and visually appealing statistical graphs while string and nltk libraries provided a comprehensive set of resources for processing textual data.Finally, the wordcloud library was used to display words from a text corpus with the size of each word representing its importance.
The dataset which is a csv file downloaded from a kaggle source was also loaded into the jupyter notebook.
Exploring the loaded datasets
The dataset represents a sample of tweets which has been labelled and categorized into positive and negative sentiments. Tweets with derogatory remarks were labelled ‘1’ and the others labelled ‘0’.
Due to its minor relevance, the first step in the exploration of the dataset involved dropping its ‘id’ column.
The label column of the dataset was visualized by means of the imported seaborn library, and It depicted over 29,000 tweets labelled ‘0’ as positive and about 3,000 tweets labelled ‘1’ as negative.
The tweets were separated into two categories of positive and negative based on the ‘0’ and ‘1’ binary classification of the ‘label’ column and the word frequency for both the positive and negative tweet data frame was obtained through the aid of wordcloud library.
Cleaning the data
Punctuation marks, such as commas,periods and exclamation marks do not typically carry much meaning in the context of natural language processing,by removing them we can reduce noise,avoid inconsistences and focus on more meaningful content.Like wise, removing stop words( e.g., “the”, “and”, “is”) which do not carry significant information for many natural language processing tasks can help reduce the dimensionality of the data and improve the efficiency of subsequent processing steps. Also by performing count vectorization, a method of representing text data numerically, an effective modeling and analysis of text data by the machine learning model can be ensured.
For this project, a function for removing punctuation marks, stop words and performing count vectorization was created. The characters in the ‘tweet’ column of the dataset served as input to this function, and a more cleaner version of the tweets were obtained.
Developing, training and testing a Naive Bayes classification model
Naive Bayes is a simple yet powerful probabilistic classification algorithm known for its efficiency and effectiveness with high-dimensional data.It relies on Bayes’ theorem which states that the probability of an event A given event B can be calculated using conditional probabaility of event B given A, along with prior probabilities of A.
To train a Naive Bayes model, you need labeled training data. The algorithm calculates the prior probabilites of each class label by counting the occurrences of each label in the training set.It also estimates the conditional probabilities of each feature given each class label. Once the model is trained, it can be used for making predictions on new,unseen instances
Specifically, scikit-learn was used to train the Naive Bayes classification model for this project.
Assessing the trained model performance of the Naive Bayes classification model
These metrics provide insights into the performance of the Naive Bayes classifier on the given binary classification task. The model has a high precision for class 0 but lower precision and recall for class 1, and this indicates a potential challenge in correctly identifying instances of class 1.
Recommendations
The performance of the model can be still improved dramatically. The model was built on an unbalanced dataset, hence by upsampling the negative class, downsampling the positive class or through the addition of other feature engineering techniques a better Naive Bayes classifier might be developed.
Remarks
Thank you for taking time to read this article, the complete code for this project,is available at: https://github.com/thebolujames/Twitter_sentiment_analysis/blob/main/Twitter%20Sentiment%20Analysis.ipynb