Tag: data-science

Building a RESTful API for Sentiment Analysis
This is a project on creating an NLP Model deployment using Python, Flask, and Postman.

Project Overview

This project is a sentiment analysis tool. It classifies text (movie reviews) as positive or negative. This classification is done using natural language processing (NLP) techniques. It includes:

– Text Pre-processing

– Machine Learning Model Training

– Flask API Development

This project allows users to enter a sentence. They can receive a prediction of its sentiment, either positive or negative, via an API endpoint.

Dataset

Dataset Name: UCI Sentiment Labelled Sentences

Description: This dataset contains labeled sentences categorized as positive (1) or negative (0).

Source: UCI Machine Learning Repository

Data Format: .txt file with sentences and labels.

Project Workflow

The project consists of several main steps:

1. Data Loading: Loading and reading the .txt file format of the dataset.

2. Data Pre-processing: Cleaning and tokenizing text, removing stopwords.

3. Feature Extraction: Converting text data into numerical features using TF-IDF Vectorizer.

4. Model Training: Training a logistic regression model on the pre-processed data.

5. Model Evaluation: Evaluating the model on test data and assessing its performance.

6. API Development: Creating an API with Flask to expose the model as a service.

Model Training and Evaluation

Algorithm Used: Logistic Regression

Feature Engineering: TF-IDF Vectorizer

Evaluation Metrics:

– Accuracy: Measured to determine the overall performance of the model.

– Confusion Matrix: Visualizes the classification performance.

API Development

I created a RESTful API using Flask. It allows users to make POST requests with sentences about movie reviews. Users can then receive sentiment predictions.

Endpoint: /predict

Method: POST

Expected Input: JSON payload with a ‘sentence’ field.

{“sentence”: “I love this movie!”}

Response:

{“prediction”: 1}

Setup and Installation

Prerequisites: Python 3.7+, pip for package management

Installation Steps:

1. Clone the Repository:

git clone https://github.com/ernestog27/data-projects.gi t

2. Create and Activate a Virtual Environment:

python3 -m venv sentiment_env

source sentiment_env/bin/activate

3. Install Required Libraries:pip install -r requirements.txt

4. Run the Flask Application:

NLP_sentiment_analysis.py

Usage

Using Postman to Test the API:

1. Open Postman and set the request method to POST.

2. Enter the endpoint URL: http://127.0.0.1:5000/predict

3. Set Headers:

– Key: Content-Type, Value: application/json

4. Request Body: Choose ‘raw’ and JSON format, then enter:

{“sentence”: “This movie is amazing!”}

5. Send Request: Postman will return the sentiment prediction.

Results

Accuracy: The model achieved an accuracy of 75.3% on the test dataset.

Sample Predictions:

– ‘The movie was fantastic!’ -> Positive (1)

– ‘I did not enjoy the movie.’ -> Negative (0)

Confusion Matrix:

Future Improvements

Expand Dataset: Add more labeled sentences for training.

Model Optimization: Experiment with other models (e.g., SVM, neural networks) and hyper-parameter tuning.

Real-Time Updates: Retrain the model periodically with new data to improve prediction accuracy.

References:

Kotzias, D. (2015). Sentiment Labelled Sentences [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C57604.

Here is the link for latest python code:

Get the code from my GitHub Here

And below is a snapshot of the code for illustration purposes as of November 2024.

For the latest see link above.
```
# Sentiment Analysis API with Flask
# Ernesto Gonzales, MSDA

import pandas as pd

# Loading the dataset
data = pd.read_csv('sentiment_env/databases/sentiment labelled sentences/imdb_labelled.txt', delimiter = '\t', header = None)
data.columns = ['Sentence', 'Label'] # Rename columns for clarity

# Data preview
print(data.head())
print(data.info())
print(data['Label'].value_counts())

# Data Cleaning and Pre-processing 

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Downloading necessary NLTK data

nltk.download('stopwords')
nltk.download('punkt_tab')

# Function for text cleaning

def preprocess_text(text):
    text = re.sub(r'\W', ' ', text) # Remove non-word characters
    text = text.lower() # Convert text to lowercase
    words = word_tokenize(text) # Tokenize text
    words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
    return ' '.join(words)

# Applying function to the text column

data['Cleaned_Sentence'] = data['Sentence'].apply(preprocess_text)

# Spliting data into training and testing sets

from sklearn.model_selection import train_test_split
X = data['Cleaned_Sentence']
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tdif = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Training and Evaluating model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tdif, y_train)

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test,y_pred))

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt 

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

# Display the confusion matrix
disp.plot()
plt.title("Confusion Matrix")
plt.show()

# Preparation for Deployment

import joblib

joblib.dump(model, 'model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

# Creating a Simple API with Flask

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Loading the saved model and vectorizer

model = joblib.load('model.pkl')
vectorizer = joblib.load('vectorizer.pkl')

@app.route('/predict', methods = ['POST'])
def predict():
    data = request.get_json(force = True)
    sentence = data['sentence']
    sentence_tfidf = vectorizer.transform([sentence])
    prediction = model.predict(sentence_tfidf)
    return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(debug=True)
    
```
I hope this helps. Let’s learn and create more.

Until the next time,

Ernesto Gonzales, MSDA.
- LinkedIn
- X
November 5, 2024
Completing My Master’s and Far from The Goal

On reaching the finish line with my master’s degree and knowing it is just the beginning.

I have completed all my classes for my Master’s Degree in Data Analytics. It took me one year to finish instead of two. I did not expect to finish it at such pace, and I’m thankful it happened that way.

All I have left is to pass my last two assignments that are going to be graded soon. One is an ARIMA model for a time series analysis. In this project I got to create revenue forecasts and it was great experience to create the model.

The second project consisted in creating a Neural Network using a Long Short-Term Memory (or LSTM for short ) model. This is a type of Recurring Neural Network (RNN).

I hope to create a second model that I can share with you soon. I found the topic very interesting.

One goal that I have for this website is to start incorporating my GitHub projects in here. The first project is going to be a Random Forest Regression Model I made using Mobile Food Delivery data.

A second project I will share is going to be a K-Means Clustering Unsupervised Machine Learning Algorithm. This particular project was fun to do and one of my favorites so far. The idea that there are hidden relationships within a dataset and find such clusters or groups was exciting to see.

In retrospect, it was a challenging program that required me to explore unfamiliar topics. I had to read several articles and videos to understand the topics at a high level.

I had to become familiar with Python and its different libraries. I used to use R programming and now it seems that I am more comfortable coding in Python than R.

It made me happy to know that I enjoy coding. Perhaps because to me it feels like a mix of creative writing and problem solving. I am excited on learning development at a later stage.

I also got to work again with Tableau, my first business intelligence tool. It is intuitive and fast, but Power BI still has its place in my heart.

Finishing my Master’s Degree was a dream of mine of many years. Doing it made me realize that there is so much to learn and explore.

This Master’s taught me a baseline, and is up to me to explore what is next. My next goal is to find my niche in the world of data.

So far I am enjoying a lot from Machine Learning and Neural Networks. I am excited to see what is in the horizon.

Until next time,

Ernesto

August 9, 2024

Tag: data-science

Building a RESTful API for Sentiment Analysis

Completing My Master’s and Far from The Goal