Category: Machine Learning

  • Feature Engineering for Machine Learning | Logistic Regression, Decision Tree & Random Forest

    Feature Engineering for Machine Learning | Logistic Regression, Decision Tree & Random Forest

    Introduction

    For Day 4, I worked on feature engineering: creating new features that help models perform better.

    I also compared different model families: Logistic Regression, Decision Tree, and Random Forest.

    Why It Matters

    Feature engineering is one of the most important skills in ML.

    The quality of your features often matters more than the choice of algorithm.

    Approach

    • Dataset: Titanic
    • New features: family_size, is_child, fare_per_person
    • Models: Logistic Regression, Decision Tree, Random Forest
    • Validation: Stratified 5-fold CV
    • Evaluation: Accuracy, F1, ROC-AUC
    • Visualization: ROC overlay of all models

    Results

    Random Forest outperformed the simpler models, and the engineered features gave all models a boost. The ROC overlay made the performance gap clear.

    Takeaways

    • Small, thoughtful features can have a big impact.
    • Tree-based models are flexible and benefit from engineered features.
    • Comparing models side by side highlights trade-offs.

    Artifacts

    Video walkthrough

  • Cross-Validation and ROC Curves on the Titanic Dataset

    Cross-Validation and ROC Curves on the Titanic Dataset

    Introduction

    Day 3 was about going beyond a single train/test split.

    I added cross-validation and looked at ROC curves to better evaluate my model.

    Why It Matters

    One train/test split can give you a lucky (or unlucky) result.

    Cross-validation makes evaluation more robust. ROC curves show how your model performs at all thresholds, not just the default 0.5.

    Approach

    • Dataset: Titanic (expanded features)
    • Features: sex, age, fare, class, embarked, sibsp, parch, alone
    • Model: Logistic Regression
    • Validation: Stratified 5-fold cross-validation
    • Evaluation: Accuracy, F1, ROC-AUC
    • Visualization: ROC curve

    Results

    Cross-validation gave a more stable estimate of performance. The ROC curve showed the model does a decent job separating survivors from non-survivors, even if it’s not perfect.

    Takeaways

    • Always validate with multiple folds, it’s more reliable.
    • ROC-AUC is a better measure than just accuracy for classification.
    • Adding more features can improve a model, but only if they add real signal.

    Artifacts

    Video walkthrough

  • Titanic Classification with Logistic Regression (Accuracy, Precision, Recall, F1)

    Titanic Classification with Logistic Regression (Accuracy, Precision, Recall, F1)

    Introduction

    For Day 2, I switched to classification with the Titanic dataset.

    This dataset is the “Hello World” of ML classification: predicting survival based on passenger features.

    Why It Matters

    Binary classification problems are everywhere: fraud vs not fraud, spam vs not spam, churn vs no churn. Titanic survival is just a teaching ground.

    Approach

    • Dataset: Titanic (Seaborn)
    • Features: sex, age, fare, class, embarked
    • Model: Logistic Regression
    • Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC
    • Visualization: Confusion Matrix

    Results

    The model correctly picked up obvious signals like sex (women had higher survival) and class (first class had better survival).

    Takeaways

    • Accuracy isn’t the only metric: precision and recall tell a deeper story.
    • Logistic Regression is simple but powerful for binary problems.
    • Visualizations like confusion matrices make results tangible.

    Artifacts

    Video walkthrough

  • Predicting Housing Prices with Linear Regression in Python

    Predicting Housing Prices with Linear Regression in Python

    Introduction

    This was the very first step in my ML journey.

    I started simple: predicting California housing prices with Linear Regression.

    The goal wasn’t to get state-of-the-art results, but to get comfortable with the workflow: loading data, cleaning it, training a model, and evaluating it properly.

    Why It Matters

    Regression is one of the building blocks of machine learning.

    Almost everything, from sales forecasts to predicting energy usage, starts with this foundation.

    Approach

    • Dataset: California housing prices
    • Features: median income, house age, rooms, population, etc.
    • Model: Linear Regression (baseline) and Ridge Regression (regularized version)
    • Evaluation: Mean Squared Error (MSE), R²

    Results

    Both models gave decent predictions, but Ridge handled multicollinearity a bit better. The main win here was learning the full pipeline end-to-end.

    Takeaways

    • Always start with a baseline, even a simple model can give insights.
    • Regularization (like Ridge) helps stabilize models when features overlap.
    • Visualization of residuals is just as important as raw metrics.

    Artifacts

    Video walkthrough

  • Building a RESTful API for Sentiment Analysis

    Building a RESTful API for Sentiment Analysis

    This is a project on creating an NLP Model deployment using Python, Flask, and Postman.

    Project Overview

    This project is a sentiment analysis tool. It classifies text (movie reviews) as positive or negative. This classification is done using natural language processing (NLP) techniques. It includes:

    – Text Pre-processing

    – Machine Learning Model Training

    – Flask API Development

    This project allows users to enter a sentence. They can receive a prediction of its sentiment, either positive or negative, via an API endpoint.

    Dataset

    Dataset Name: UCI Sentiment Labelled Sentences

    Description: This dataset contains labeled sentences categorized as positive (1) or negative (0).

    Source: UCI Machine Learning Repository

    Data Format: .txt file with sentences and labels.

    Project Workflow

    The project consists of several main steps:

    1. Data Loading: Loading and reading the .txt file format of the dataset.

    2. Data Pre-processing: Cleaning and tokenizing text, removing stopwords.

    3. Feature Extraction: Converting text data into numerical features using TF-IDF Vectorizer.

    4. Model Training: Training a logistic regression model on the pre-processed data.

    5. Model Evaluation: Evaluating the model on test data and assessing its performance.

    6. API Development: Creating an API with Flask to expose the model as a service.

    Model Training and Evaluation

    Algorithm Used: Logistic Regression

    Feature Engineering: TF-IDF Vectorizer

    Evaluation Metrics:

    – Accuracy: Measured to determine the overall performance of the model.

    – Confusion Matrix: Visualizes the classification performance.

    API Development

    I created a RESTful API using Flask. It allows users to make POST requests with sentences about movie reviews. Users can then receive sentiment predictions.

    Endpoint: /predict

    Method: POST

    Expected Input: JSON payload with a ‘sentence’ field.

    {“sentence”: “I love this movie!”}

    Response:

    {“prediction”: 1}

    Setup and Installation

    Prerequisites: Python 3.7+, pip for package management

    Installation Steps:

    1. Clone the Repository:

    git clone https://github.com/ernestog27/data-projects.git

    2. Create and Activate a Virtual Environment:

    python3 -m venv sentiment_env

    source sentiment_env/bin/activate

    3. Install Required Libraries:pip install -r requirements.txt

    4. Run the Flask Application:

    NLP_sentiment_analysis.py

    Usage

    Using Postman to Test the API:

    1. Open Postman and set the request method to POST.

    2. Enter the endpoint URL: http://127.0.0.1:5000/predict

    3. Set Headers:

    – Key: Content-Type, Value: application/json

    4. Request Body: Choose ‘raw’ and JSON format, then enter:

    {“sentence”: “This movie is amazing!”}

    5. Send Request: Postman will return the sentiment prediction.

    Results

    Accuracy: The model achieved an accuracy of 75.3% on the test dataset.

    Sample Predictions:

    – ‘The movie was fantastic!’ -> Positive (1)

    – ‘I did not enjoy the movie.’ -> Negative (0)

    Confusion Matrix:

    Future Improvements

    Expand Dataset: Add more labeled sentences for training.

    Model Optimization: Experiment with other models (e.g., SVM, neural networks) and hyper-parameter tuning.

    Real-Time Updates: Retrain the model periodically with new data to improve prediction accuracy.

    References:

    Kotzias, D. (2015). Sentiment Labelled Sentences [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C57604.

    Here is the link for latest python code:

    And below is a snapshot of the code for illustration purposes as of November 2024.

    For the latest see link above.

    # Sentiment Analysis API with Flask
    # Ernesto Gonzales, MSDA
    
    import pandas as pd
    
    # Loading the dataset
    data = pd.read_csv('sentiment_env/databases/sentiment labelled sentences/imdb_labelled.txt', delimiter = '\t', header = None)
    data.columns = ['Sentence', 'Label'] # Rename columns for clarity
    
    # Data preview
    print(data.head())
    print(data.info())
    print(data['Label'].value_counts())
    
    # Data Cleaning and Pre-processing 
    
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    import re
    
    # Downloading necessary NLTK data
    
    nltk.download('stopwords')
    nltk.download('punkt_tab')
    
    # Function for text cleaning
    
    def preprocess_text(text):
        text = re.sub(r'\W', ' ', text) # Remove non-word characters
        text = text.lower() # Convert text to lowercase
        words = word_tokenize(text) # Tokenize text
        words = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
        return ' '.join(words)
    
    # Applying function to the text column
    
    data['Cleaned_Sentence'] = data['Sentence'].apply(preprocess_text)
    
    # Spliting data into training and testing sets
    
    from sklearn.model_selection import train_test_split
    X = data['Cleaned_Sentence']
    y = data['Label']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    # Feature extraction
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(max_features = 5000)
    X_train_tdif = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)
    
    # Training and Evaluating model
    
    from sklearn.linear_model import LogisticRegression
    
    model = LogisticRegression()
    model.fit(X_train_tdif, y_train)
    
    from sklearn.metrics import accuracy_score, classification_report
    
    y_pred = model.predict(X_test_tfidf)
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test,y_pred))
    
    from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
    import matplotlib.pyplot as plt 
    
    ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    
    # Display the confusion matrix
    disp.plot()
    plt.title("Confusion Matrix")
    plt.show()
    
    # Preparation for Deployment
    
    import joblib
    
    joblib.dump(model, 'model.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')
    
    # Creating a Simple API with Flask
    
    from flask import Flask, request, jsonify
    import joblib
    
    app = Flask(__name__)
    
    # Loading the saved model and vectorizer
    
    model = joblib.load('model.pkl')
    vectorizer = joblib.load('vectorizer.pkl')
    
    @app.route('/predict', methods = ['POST'])
    def predict():
        data = request.get_json(force = True)
        sentence = data['sentence']
        sentence_tfidf = vectorizer.transform([sentence])
        prediction = model.predict(sentence_tfidf)
        return jsonify({'prediction': int(prediction[0])})
    
    if __name__ == '__main__':
        app.run(debug=True)
        

    I hope this helps. Let’s learn and create more.

    Until the next time,

    Ernesto Gonzales, MSDA.