Artificial Intelligence

Multi-Class Sentiment Review Classifier

Case Study DetailsID: sentiment-analysis

Sentiment Analysis of Product Reviews

Project Report

Date: May 20, 2025

1. Introduction

This report details the development and evaluation of a machine learning model designed to classify the sentiment of customer reviews for a fictional product. The primary objective was to categorize reviews into positive, negative, or neutral sentiments. This task is crucial for businesses to understand customer feedback, identify areas for product improvement, and gauge overall customer satisfaction. The project encompassed text preprocessing, feature extraction, model training using various classifiers, and a thorough performance evaluation.

2. Data

2.1. Data Source and Generation

The dataset used for this project consisted of 1000 mock customer reviews. These reviews were synthetically generated using a Python script. The script employed predefined templates for positive, negative, and neutral sentiments, into which specific product-related keywords (e.g., "battery life," "design," "customer service") were randomly inserted. Each generated review included a review_id, the review_text, and a rating (an integer from 1 to 5).

2.2. Sentiment Labeling

A 'sentiment' label was derived from the 'rating' column as follows:

Negative: Ratings of 1 or 2 stars.

Neutral: Ratings of 3 stars.

Positive: Ratings of 4 or 5 stars.

2.3. Initial Data Overview

Upon generation and loading, the dataset consisted of 1000 entries with no missing values. The distribution of sentiments in the dataset was observed (e.g., Positive: 390, Negative: 382, Neutral: 228, based on a sample run). This distribution was considered during the train-test split using stratification.

3. Methodology

The sentiment analysis pipeline involved several key stages:

3.1. Text Preprocessing

To prepare the text data for machine learning, the following preprocessing steps were applied to each review in the review_text column, creating a clean_text column:

Lowercasing: All text was converted to lowercase to ensure uniformity.

Punctuation and Special Character Removal: Non-alphabetic characters (except spaces) were removed using regular expressions (re.sub(r'[^a-z\s]', '', text)).

Tokenization: Reviews were broken down into individual words (tokens) using nltk.word_tokenize.

Stop-Word Removal: Common English stop words (e.g., "the," "is," "and") were removed using the list provided by nltk.corpus.stopwords.

Lemmatization with Part-of-Speech (POS) Tagging: Tokens were reduced to their base or dictionary form (lemma). NLTK's WordNetLemmatizer was used, aided by POS tagging (nltk.pos_tag) to provide context (e.g., noun, verb, adjective) for more accurate lemmatization. A helper function (get_wordnet_pos) was used to map NLTK's POS tags to a format compatible with the lemmatizer.

3.2. Feature Extraction

The preprocessed text data (clean_text) was converted into numerical features using the Term Frequency-Inverse Document Frequency (TF-IDF) method. This was implemented using TfidfVectorizer from sklearn.feature_extraction.text. Key parameters for the vectorizer included:

ngram_range=(1, 2): To consider both individual words (unigrams) and pairs of adjacent words (bigrams) as features.

max_features=5000: To limit the vocabulary size to the 5000 most frequent n-grams.

min_df=2: To ignore terms that appeared in fewer than 2 documents.

max_df=0.95: To ignore terms that appeared in more than 95% of the documents (often too common to be discriminative).

3.3. Data Splitting

The dataset (consisting of clean_text as features and sentiment as the target) was split into training (80%) and testing (20%) sets using train_test_split from sklearn.model_selection. Stratification (stratify=y) was employed to ensure that the proportion of sentiment classes was maintained in both the training and testing sets. A random_state=42 was used for reproducibility.

3.4. Modeling

Three common classification algorithms were selected for this task:

Naive Bayes: MultinomialNB()

Logistic Regression: LogisticRegression(max_iter=1000, solver='liblinear', random_state=42)

Linear Support Vector Machine (SVM): LinearSVC(random_state=42, dual='auto', max_iter=2000)

To ensure correct data handling and prevent data leakage (where information from the test set inadvertently influences the training process), each classifier was integrated with the TfidfVectorizer into an sklearn.pipeline.Pipeline. This pipeline handles the fitting of the vectorizer and the training of the classifier exclusively on the training data, and then applies the learned transformations and model to the test data.

4. Results and Discussion

4.1. Evaluation Metrics

The performance of each model was evaluated on the test set using the following standard metrics:

Accuracy: The proportion of correctly classified reviews.

Precision: The ability of the classifier not to label as positive a sample that is negative.

Recall (Sensitivity): The ability of the classifier to find all the positive samples.

F1-Score: The weighted average of Precision and Recall.

Confusion Matrix: A table showing the counts of true vs. predicted classifications for each class.

4.2. Model Performance

Upon evaluation, all three models—Naive Bayes, Logistic Regression, and Linear SVM—achieved perfect scores across all metrics on the test set:

Accuracy: 1.0000

Precision (per class and averaged): 1.00

Recall (per class and averaged): 1.00

F1-Score (per class and averaged): 1.00

The confusion matrices for each model showed perfect diagonal alignment, indicating no misclassifications.

4.3. Discussion of Perfect Scores

Achieving perfect evaluation scores is highly unusual in real-world text classification tasks. Initial investigations considered potential data leakage in the feature engineering pipeline. The methodology was subsequently revised to correctly implement train_test_split before any feature transformation and to use sklearn.pipeline.Pipeline to ensure the TfidfVectorizer was trained only on the training data.

Despite these methodological corrections, the perfect scores persisted. This strongly suggests that the primary reason for this outcome is the inherent nature of the synthetically generated mock data. The template-based approach, even with keyword randomization, likely produces reviews where:

Linguistic patterns, specific keywords, and n-grams for each sentiment category (positive, negative, neutral) are highly distinct and unique.

There is minimal ambiguity, subtlety, or overlapping vocabulary that typically characterizes real-world human language.

As a result, the TF-IDF features generated from this data likely provide very clear and unambiguous signals for each class, allowing the machine learning models to learn a perfect separation boundary. While the implemented pipeline is methodologically sound, the simplicity and high separability of the mock data make the classification task trivial for the chosen algorithms. This outcome underscores the significant impact of data characteristics on model performance and highlights the difference between performance on synthetic datasets versus more complex, noisy real-world data.

(Optional: Insert example confusion matrices or word clouds here if they add to the discussion, though with perfect scores, their primary analytical value is to confirm the perfect classification.)

5. Challenges

The project presented a few key challenges:

Initial Risk of Data Leakage: Ensuring that the feature extraction process (TF-IDF vectorization) did not learn from the test set was critical. This was addressed by correctly sequencing the train_test_split and by implementing sklearn.pipeline.Pipeline.

Interpreting Perfect Evaluation Scores: The unexpected perfect scores required careful consideration. It was important to differentiate between a methodological error (which was addressed) and the characteristics of the dataset itself.

Limitations of Synthetic Data: The template-based mock data, while useful for setting up the pipeline, does not fully represent the complexity, nuance, and noise of real-world customer reviews. This limits the generalizability of the observed perfect performance to real-world scenarios.

6. Potential Improvements

While the models performed perfectly on the current dataset, several avenues exist for extending and improving the project, particularly for application to real-world data:

Data Enhancement:

Advanced Mock Data: Employ more sophisticated mock data generation techniques that introduce greater linguistic variety, ambiguity, and overlap between sentiment classes.

Real-World Data: Utilize publicly available real-world product review datasets (e.g., from Amazon, Yelp) to provide a more realistic benchmark and training environment.

Advanced Feature Extraction:

Explore word embeddings (e.g., Word2Vec, GloVe, FastText, pre-trained or trained контекст-aware embeddings like BERT) to capture semantic relationships between words, which can be beneficial for more nuanced text.

Advanced Models:

For more complex datasets, experiment with deep learning models such as Recurrent Neural Networks (LSTMs, GRUs) or Transformer-based models (e.g., BERT, RoBERTa) which often achieve state-of-the-art performance on NLP tasks.

Hyperparameter Tuning:

Implement systematic hyperparameter optimization techniques (e.g., GridSearchCV, RandomizedSearchCV) for the chosen models and TF-IDF vectorizer, especially when working with data where performance is not already at its ceiling.

Error Analysis (with Real Data):

If working with real data that yields imperfect scores, conduct a thorough error analysis by examining misclassified reviews to understand model weaknesses and identify areas for improvement in preprocessing or feature engineering.

7. Conclusion

This project successfully demonstrated the development and evaluation of a sentiment analysis pipeline. Key steps including text preprocessing, TF-IDF feature extraction, and training of Naive Bayes, Logistic Regression, and Linear SVM models were implemented. A crucial aspect was the correct application of these steps within an sklearn.pipeline.Pipeline to prevent data leakage and ensure robust evaluation.

The models achieved perfect classification accuracy on the synthetically generated mock dataset. This outcome, while demonstrating a correctly implemented pipeline, is primarily attributed to the high separability and distinct linguistic patterns inherent in the template-based mock data. This highlights the significant influence of data characteristics on model performance and underscores the difference between synthetic and real-world text analysis.

The learning objectives, including understanding text preprocessing techniques, feature extraction methods, model training, evaluation metrics, and the importance of sound experimental methodology, were successfully met. The project provides a solid foundational understanding for tackling more complex sentiment analysis tasks on real-world datasets, where the nuances of human language present greater challenges and opportunities for model refinement.

Project Summary

Brief Description

Built text preprocessing and classification models to automatically categorize product reviews into positive, neutral, or negative classes.

Methodology Summary

Preprocessed 1,000 customer reviews using tokenization, stop-word removal, and POS-tagged lemmatization (NLTK). Extracted unigram/bigram features using TF-IDF. Compared Naive Bayes, SVM, and Logistic Regression.

Results & Performance

All three classifiers achieved 100% test accuracy. Verified that the performance is due to highly distinct linguistic templates in the synthetic dataset.

Tech Stack

NLPTF-IDF VectorizationNLTK LemmatizationScikit-learn Pipelines

Author:Muhammad Ahsan

Date:2025 - 2026

Class:ai