Cross-validation gave a more stable estimate of performance. The ROC curve showed the model does a decent job separating survivors from non-survivors, even if it’s not perfect.
Takeaways
Always validate with multiple folds, it’s more reliable.
ROC-AUC is a better measure than just accuracy for classification.
Adding more features can improve a model, but only if they add real signal.
I started simple: predicting California housing prices with Linear Regression.
The goal wasn’t to get state-of-the-art results, but to get comfortable with the workflow: loading data, cleaning it, training a model, and evaluating it properly.
Why It Matters
Regression is one of the building blocks of machine learning.
Almost everything, from sales forecasts to predicting energy usage, starts with this foundation.
Approach
Dataset: California housing prices
Features: median income, house age, rooms, population, etc.
Model: Linear Regression (baseline) and Ridge Regression (regularized version)
Evaluation: Mean Squared Error (MSE), R²
Results
Both models gave decent predictions, but Ridge handled multicollinearity a bit better. The main win here was learning the full pipeline end-to-end.
Takeaways
Always start with a baseline, even a simple model can give insights.
Regularization (like Ridge) helps stabilize models when features overlap.
Visualization of residuals is just as important as raw metrics.
This is a project on creating an NLP Model deployment using Python, Flask, and Postman.
Project Overview
This project is a sentiment analysis tool. It classifies text (movie reviews) as positive or negative. This classification is done using natural language processing (NLP) techniques. It includes:
– Text Pre-processing
– Machine Learning Model Training
– Flask API Development
This project allows users to enter a sentence. They can receive a prediction of its sentiment, either positive or negative, via an API endpoint.
Description: This dataset contains labeled sentences categorized as positive (1) or negative (0).
Source: UCI Machine Learning Repository
Data Format: .txt file with sentences and labels.
Project Workflow
The project consists of several main steps:
1. Data Loading: Loading and reading the .txt file format of the dataset.
2. Data Pre-processing: Cleaning and tokenizing text, removing stopwords.
3. Feature Extraction: Converting text data into numerical features using TF-IDF Vectorizer.
4. Model Training: Training a logistic regression model on the pre-processed data.
5. Model Evaluation: Evaluating the model on test data and assessing its performance.
6. API Development: Creating an API with Flask to expose the model as a service.
Model Training and Evaluation
Algorithm Used: Logistic Regression
Feature Engineering: TF-IDF Vectorizer
Evaluation Metrics:
– Accuracy: Measured to determine the overall performance of the model.
– Confusion Matrix: Visualizes the classification performance.
API Development
I created a RESTful API using Flask. It allows users to make POST requests with sentences about movie reviews. Users can then receive sentiment predictions.
Endpoint: /predict
Method: POST
Expected Input: JSON payload with a ‘sentence’ field.
{“sentence”: “I love this movie!”}
Response:
{“prediction”: 1}
Setup and Installation
Prerequisites: Python 3.7+, pip for package management
And below is a snapshot of the code for illustration purposes as of November 2024.
For the latest see link above.
# Sentiment Analysis API with Flask
# Ernesto Gonzales, MSDA
import pandas as pd
# Loading the dataset
data = pd.read_csv('sentiment_env/databases/sentiment labelled sentences/imdb_labelled.txt', delimiter = '\t', header = None)
data.columns = ['Sentence', 'Label'] # Rename columns for clarity
# Data preview
print(data.head())
print(data.info())
print(data['Label'].value_counts())
# Data Cleaning and Pre-processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
# Downloading necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt_tab')
# Function for text cleaning
def preprocess_text(text):
text = re.sub(r'\W', ' ', text) # Remove non-word characters
text = text.lower() # Convert text to lowercase
words = word_tokenize(text) # Tokenize text
words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
return ' '.join(words)
# Applying function to the text column
data['Cleaned_Sentence'] = data['Sentence'].apply(preprocess_text)
# Spliting data into training and testing sets
from sklearn.model_selection import train_test_split
X = data['Cleaned_Sentence']
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_tdif = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Training and Evaluating model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tdif, y_train)
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test,y_pred))
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
# Display the confusion matrix
disp.plot()
plt.title("Confusion Matrix")
plt.show()
# Preparation for Deployment
import joblib
joblib.dump(model, 'model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
# Creating a Simple API with Flask
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
# Loading the saved model and vectorizer
model = joblib.load('model.pkl')
vectorizer = joblib.load('vectorizer.pkl')
@app.route('/predict', methods = ['POST'])
def predict():
data = request.get_json(force = True)
sentence = data['sentence']
sentence_tfidf = vectorizer.transform([sentence])
prediction = model.predict(sentence_tfidf)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)