Student Performance Predictor — Case Study

Problem

Educational institutions lack actionable tools to identify at-risk students early. Manual analysis of academic performance data is time-consuming, doesn't scale, and often misses subtle patterns that predict student outcomes. Existing solutions are either too expensive for smaller institutions or too complex for everyday use by educators.

Goals

Regression

Predict final exam scores with RMSE below 8 points, enabling early intervention for students trending toward poor performance.

Classification

Classify pass/fail outcomes with greater than 90% accuracy, providing a reliable early warning system for educators.

Deployment

Deploy as an interactive web application where educators can input student data and receive instant predictions.

Reproducibility

Create a fully reproducible training pipeline with automated data ingestion, preprocessing, and model evaluation.

Architecture

┌──────────────────────────────────────────────────────────────────┐ │ Student Performance Predictor │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Data Pipeline Feature Engineering Model Training Deployment │ ┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌──────────┐ │ │ UCI Dataset │ │ Study Efficiency │ │ Linear Reg. │ │ Streamlit│ │ │ Download │→ │ Engagement Score │→ │ Random Forest │→ │ Web App │ │ │ Synthetic │ │ Risk Index │ │ Gradient Boost │ │ Interactive│ │ │ Fallback │ │ Grade Categories │ │ XGBoost · SVM │ │ Dashboard│ │ └──────────────┘ └──────────────────┘ └─────────────────┘ └──────────┘ │ │ │ └─ joblib artifacts: models, preprocessors, metrics ──────────┘ │ └──────────────────────────────────────────────────────────────────┘

System Design

The system is organized as a modular pipeline with distinct stages for data ingestion, feature engineering, model training, evaluation, and deployment. Each stage is independently runnable and produces artifacts consumed by the next stage.

Data Ingestion

Automatic download of the UCI Student Performance Dataset with synthetic data fallback when the source is unavailable. Handles missing values and type coercion automatically.

Feature Engineering

Creates derived features including study efficiency (hours × prior GPA), engagement score (extracurricular + attendance composite), risk index, and grade category bins.

Model Training

Trains and compares 6 algorithms: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and SVM. sklearn pipelines with median imputation, standard scaling, and one-hot encoding.

Evaluation

Comprehensive metrics comparison across all models. Hyperparameter tuning via grid search. Artifact persistence via joblib for models, preprocessors, and metrics.

Tech Stack

Python 3.12+

Core language for all data processing and model training

scikit-learn

Pipeline construction, preprocessing, and model implementations

pandas / NumPy

Data manipulation, aggregation, and numerical operations

Streamlit

Interactive web UI for real-time predictions with visualizations

XGBoost

Gradient boosting framework for high-performance classification

matplotlib / seaborn

Feature importance plots, confusion matrices, and distribution charts

joblib

Model serialization and artifact persistence

Results

R² 0.88

Regression (Random Forest)

RMSE 7.86

Root Mean Squared Error

94.9%

Classification Accuracy

F1 0.97

F1 Score (Classification)

Top predictors identified: previous grades (G1, G2), study time, and attendance. The Gradient Boosting classifier achieved the highest accuracy, while Random Forest regressor produced the best R² score.

Challenges & Trade-offs

Challenge

Small dataset size: The UCI dataset contains only 649 samples with 33 features, limiting model complexity and increasing overfitting risk.

Solution

Used cross-validation (5-fold), synthetic data augmentation, and prioritized simpler models (Random Forest over deep learning). Feature engineering created informative derived features.

Challenge

Imbalanced classes: Pass/fail distribution skewed toward passing students, which could bias the classifier toward the majority class.

Solution

Applied class weighting in the loss function and evaluated using precision, recall, and F1 score in addition to accuracy to ensure balanced performance.

Challenge

Data source reliability: The UCI dataset URL occasionally changes or becomes unavailable, breaking the pipeline.

Solution

Built a synthetic data generator that creates realistic student data as a fallback. The pipeline detects download failures and switches to synthetic data automatically.

Implementation Highlights

Automated Pipeline

Single-command execution: python train.py runs the full pipeline from data ingestion through model evaluation. All artifacts are versioned and stored in artifacts/.

Interactive Dashboard

Streamlit app provides real-time predictions with feature sliders, model comparison views, and feature importance visualization. No coding required to use.

sklearn Pipelines

All preprocessing (imputation, scaling, encoding) is bundled into sklearn Pipeline objects, ensuring consistent transformation between training and inference.

Model Comparison

Six models compared across 4 metrics (R², RMSE, MAE, Accuracy) with automatic best-model selection. Results logged to a comparison CSV for analysis.

Lessons Learned

Feature engineering had significantly more impact than model selection — engineered features (study efficiency, risk index) improved R² by 0.12 over raw features alone. Cross-validation was essential for reliable performance estimates on the small dataset. The synthetic data fallback proved valuable during development and demonstrations when the source dataset was unavailable.