Cross-validation gave a more stable estimate of performance. The ROC curve showed the model does a decent job separating survivors from non-survivors, even if it’s not perfect.
Takeaways
Always validate with multiple folds, it’s more reliable.
ROC-AUC is a better measure than just accuracy for classification.
Adding more features can improve a model, but only if they add real signal.
This is a project on creating an NLP Model deployment using Python, Flask, and Postman.
Project Overview
This project is a sentiment analysis tool. It classifies text (movie reviews) as positive or negative. This classification is done using natural language processing (NLP) techniques. It includes:
– Text Pre-processing
– Machine Learning Model Training
– Flask API Development
This project allows users to enter a sentence. They can receive a prediction of its sentiment, either positive or negative, via an API endpoint.
Description: This dataset contains labeled sentences categorized as positive (1) or negative (0).
Source: UCI Machine Learning Repository
Data Format: .txt file with sentences and labels.
Project Workflow
The project consists of several main steps:
1. Data Loading: Loading and reading the .txt file format of the dataset.
2. Data Pre-processing: Cleaning and tokenizing text, removing stopwords.
3. Feature Extraction: Converting text data into numerical features using TF-IDF Vectorizer.
4. Model Training: Training a logistic regression model on the pre-processed data.
5. Model Evaluation: Evaluating the model on test data and assessing its performance.
6. API Development: Creating an API with Flask to expose the model as a service.
Model Training and Evaluation
Algorithm Used: Logistic Regression
Feature Engineering: TF-IDF Vectorizer
Evaluation Metrics:
– Accuracy: Measured to determine the overall performance of the model.
– Confusion Matrix: Visualizes the classification performance.
API Development
I created a RESTful API using Flask. It allows users to make POST requests with sentences about movie reviews. Users can then receive sentiment predictions.
Endpoint: /predict
Method: POST
Expected Input: JSON payload with a ‘sentence’ field.
{“sentence”: “I love this movie!”}
Response:
{“prediction”: 1}
Setup and Installation
Prerequisites: Python 3.7+, pip for package management
And below is a snapshot of the code for illustration purposes as of November 2024.
For the latest see link above.
# Sentiment Analysis API with Flask
# Ernesto Gonzales, MSDA
import pandas as pd
# Loading the dataset
data = pd.read_csv('sentiment_env/databases/sentiment labelled sentences/imdb_labelled.txt', delimiter = '\t', header = None)
data.columns = ['Sentence', 'Label'] # Rename columns for clarity
# Data preview
print(data.head())
print(data.info())
print(data['Label'].value_counts())
# Data Cleaning and Pre-processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
# Downloading necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt_tab')
# Function for text cleaning
def preprocess_text(text):
text = re.sub(r'\W', ' ', text) # Remove non-word characters
text = text.lower() # Convert text to lowercase
words = word_tokenize(text) # Tokenize text
words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
return ' '.join(words)
# Applying function to the text column
data['Cleaned_Sentence'] = data['Sentence'].apply(preprocess_text)
# Spliting data into training and testing sets
from sklearn.model_selection import train_test_split
X = data['Cleaned_Sentence']
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tdif = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Training and Evaluating model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tdif, y_train)
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test,y_pred))
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
# Display the confusion matrix
disp.plot()
plt.title("Confusion Matrix")
plt.show()
# Preparation for Deployment
import joblib
joblib.dump(model, 'model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
# Creating a Simple API with Flask
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
# Loading the saved model and vectorizer
model = joblib.load('model.pkl')
vectorizer = joblib.load('vectorizer.pkl')
@app.route('/predict', methods = ['POST'])
def predict():
data = request.get_json(force = True)
sentence = data['sentence']
sentence_tfidf = vectorizer.transform([sentence])
prediction = model.predict(sentence_tfidf)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)