Ernesto Gonzales, MSDA

Tag: Projects

Machine Learning Projects Applied Across Domains.

Feature Engineering for Machine Learning | Logistic Regression, Decision Tree & Random Forest
Introduction

For Day 4, I worked on feature engineering: creating new features that help models perform better.

I also compared different model families: Logistic Regression, Decision Tree, and Random Forest.

Why It Matters

Feature engineering is one of the most important skills in ML.

The quality of your features often matters more than the choice of algorithm.

Approach
- Dataset: Titanic
- New features: family_size, is_child, fare_per_person
- Models: Logistic Regression, Decision Tree, Random Forest
- Validation: Stratified 5-fold CV
- Evaluation: Accuracy, F1, ROC-AUC
- Visualization: ROC overlay of all models
Results

Random Forest outperformed the simpler models, and the engineered features gave all models a boost. The ROC overlay made the performance gap clear.

Takeaways
- Small, thoughtful features can have a big impact.
- Tree-based models are flexible and benefit from engineered features.
- Comparing models side by side highlights trade-offs.
Artifacts
- Notebook on GitHub
Video walkthrough
September 19, 2025
Cross-Validation and ROC Curves on the Titanic Dataset
Introduction

Day 3 was about going beyond a single train/test split.

I added cross-validation and looked at ROC curves to better evaluate my model.

Why It Matters

One train/test split can give you a lucky (or unlucky) result.

Cross-validation makes evaluation more robust. ROC curves show how your model performs at all thresholds, not just the default 0.5.

Approach
- Dataset: Titanic (expanded features)
- Features: sex, age, fare, class, embarked, sibsp, parch, alone
- Model: Logistic Regression
- Validation: Stratified 5-fold cross-validation
- Evaluation: Accuracy, F1, ROC-AUC
- Visualization: ROC curve
Results

Cross-validation gave a more stable estimate of performance. The ROC curve showed the model does a decent job separating survivors from non-survivors, even if it’s not perfect.

Takeaways
- Always validate with multiple folds, it’s more reliable.
- ROC-AUC is a better measure than just accuracy for classification.
- Adding more features can improve a model, but only if they add real signal.
Artifacts
- Notebook on GitHub
Video walkthrough
September 19, 2025
Titanic Classification with Logistic Regression (Accuracy, Precision, Recall, F1)
Introduction

For Day 2, I switched to classification with the Titanic dataset.

This dataset is the “Hello World” of ML classification: predicting survival based on passenger features.

Why It Matters

Binary classification problems are everywhere: fraud vs not fraud, spam vs not spam, churn vs no churn. Titanic survival is just a teaching ground.

Approach
- Dataset: Titanic (Seaborn)
- Features: sex, age, fare, class, embarked
- Model: Logistic Regression
- Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC
- Visualization: Confusion Matrix
Results

The model correctly picked up obvious signals like sex (women had higher survival) and class (first class had better survival).

Takeaways
- Accuracy isn’t the only metric: precision and recall tell a deeper story.
- Logistic Regression is simple but powerful for binary problems.
- Visualizations like confusion matrices make results tangible.
Artifacts
- Notebook on GitHub
Video walkthrough
September 19, 2025
Predicting Housing Prices with Linear Regression in Python
Introduction

This was the very first step in my ML journey.

I started simple: predicting California housing prices with Linear Regression.

The goal wasn’t to get state-of-the-art results, but to get comfortable with the workflow: loading data, cleaning it, training a model, and evaluating it properly.

Why It Matters

Regression is one of the building blocks of machine learning.

Almost everything, from sales forecasts to predicting energy usage, starts with this foundation.

Approach
- Dataset: California housing prices
- Features: median income, house age, rooms, population, etc.
- Model: Linear Regression (baseline) and Ridge Regression (regularized version)
- Evaluation: Mean Squared Error (MSE), R²
Results

Both models gave decent predictions, but Ridge handled multicollinearity a bit better. The main win here was learning the full pipeline end-to-end.

Takeaways
- Always start with a baseline, even a simple model can give insights.
- Regularization (like Ridge) helps stabilize models when features overlap.
- Visualization of residuals is just as important as raw metrics.
Artifacts
- Notebook on GitHub
Video walkthrough
September 19, 2025