Machine Learning API – Ernesto Gonzales, MSDA

This is a project on creating an NLP Model deployment using Python, Flask, and Postman.

Project Overview

This project is a sentiment analysis tool. It classifies text (movie reviews) as positive or negative. This classification is done using natural language processing (NLP) techniques. It includes:

– Text Pre-processing

– Machine Learning Model Training

– Flask API Development

This project allows users to enter a sentence. They can receive a prediction of its sentiment, either positive or negative, via an API endpoint.

Dataset

Dataset Name: UCI Sentiment Labelled Sentences

Description: This dataset contains labeled sentences categorized as positive (1) or negative (0).

Source: UCI Machine Learning Repository

Data Format: .txt file with sentences and labels.

Project Workflow

The project consists of several main steps:

1. Data Loading: Loading and reading the .txt file format of the dataset.

2. Data Pre-processing: Cleaning and tokenizing text, removing stopwords.

3. Feature Extraction: Converting text data into numerical features using TF-IDF Vectorizer.

4. Model Training: Training a logistic regression model on the pre-processed data.

5. Model Evaluation: Evaluating the model on test data and assessing its performance.

6. API Development: Creating an API with Flask to expose the model as a service.

Model Training and Evaluation

Algorithm Used: Logistic Regression

Feature Engineering: TF-IDF Vectorizer

Evaluation Metrics:

– Accuracy: Measured to determine the overall performance of the model.

– Confusion Matrix: Visualizes the classification performance.

API Development

I created a RESTful API using Flask. It allows users to make POST requests with sentences about movie reviews. Users can then receive sentiment predictions.

Endpoint: /predict

Method: POST

Expected Input: JSON payload with a ‘sentence’ field.

{“sentence”: “I love this movie!”}

Response:

{“prediction”: 1}

Setup and Installation

Prerequisites: Python 3.7+, pip for package management

Installation Steps:

1. Clone the Repository:

git clone https://github.com/ernestog27/data-projects.gi t

2. Create and Activate a Virtual Environment:

python3 -m venv sentiment_env

source sentiment_env/bin/activate

3. Install Required Libraries:pip install -r requirements.txt

4. Run the Flask Application:

NLP_sentiment_analysis.py

Usage

Using Postman to Test the API:

1. Open Postman and set the request method to POST.

2. Enter the endpoint URL: http://127.0.0.1:5000/predict

3. Set Headers:

– Key: Content-Type, Value: application/json

4. Request Body: Choose ‘raw’ and JSON format, then enter:

{“sentence”: “This movie is amazing!”}

5. Send Request: Postman will return the sentiment prediction.

Results

Accuracy: The model achieved an accuracy of 75.3% on the test dataset.

Sample Predictions:

– ‘The movie was fantastic!’ -> Positive (1)

– ‘I did not enjoy the movie.’ -> Negative (0)

Confusion Matrix:

Future Improvements

Expand Dataset: Add more labeled sentences for training.

Model Optimization: Experiment with other models (e.g., SVM, neural networks) and hyper-parameter tuning.

Real-Time Updates: Retrain the model periodically with new data to improve prediction accuracy.

References:

Kotzias, D. (2015). Sentiment Labelled Sentences [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C57604.

Here is the link for latest python code:

Get the code from my GitHub Here

And below is a snapshot of the code for illustration purposes as of November 2024.

For the latest see link above.

# Sentiment Analysis API with Flask
# Ernesto Gonzales, MSDA

import pandas as pd

# Loading the dataset
data = pd.read_csv('sentiment_env/databases/sentiment labelled sentences/imdb_labelled.txt', delimiter = '\t', header = None)
data.columns = ['Sentence', 'Label'] # Rename columns for clarity

# Data preview
print(data.head())
print(data.info())
print(data['Label'].value_counts())

# Data Cleaning and Pre-processing 

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Downloading necessary NLTK data

nltk.download('stopwords')
nltk.download('punkt_tab')

# Function for text cleaning

def preprocess_text(text):
    text = re.sub(r'\W', ' ', text) # Remove non-word characters
    text = text.lower() # Convert text to lowercase
    words = word_tokenize(text) # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
    return ' '.join(words)

# Applying function to the text column

data['Cleaned_Sentence'] = data['Sentence'].apply(preprocess_text)

# Spliting data into training and testing sets

from sklearn.model_selection import train_test_split
X = data['Cleaned_Sentence']
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tdif = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Training and Evaluating model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tdif, y_train)

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test,y_pred))

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt 

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

# Display the confusion matrix
disp.plot()
plt.title("Confusion Matrix")
plt.show()

# Preparation for Deployment

import joblib

joblib.dump(model, 'model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

# Creating a Simple API with Flask

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Loading the saved model and vectorizer

model = joblib.load('model.pkl')
vectorizer = joblib.load('vectorizer.pkl')

@app.route('/predict', methods = ['POST'])
def predict():
    data = request.get_json(force = True)
    sentence = data['sentence']
    sentence_tfidf = vectorizer.transform([sentence])
    prediction = model.predict(sentence_tfidf)
    return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(debug=True)

I hope this helps. Let’s learn and create more.

Until the next time,

Ernesto Gonzales, MSDA.

Tag: Machine Learning API

Building a RESTful API for Sentiment Analysis