End-to-end machine learning pipeline that predicts student academic outcomes with production-grade accuracy. From raw data to deployed web application.
Educational institutions lack actionable tools to identify at-risk students early. Manual analysis of academic performance data is time-consuming, doesn't scale, and often misses subtle patterns that predict student outcomes. Existing solutions are either too expensive for smaller institutions or too complex for everyday use by educators.
Predict final exam scores with RMSE below 8 points, enabling early intervention for students trending toward poor performance.
Classify pass/fail outcomes with greater than 90% accuracy, providing a reliable early warning system for educators.
Deploy as an interactive web application where educators can input student data and receive instant predictions.
Create a fully reproducible training pipeline with automated data ingestion, preprocessing, and model evaluation.
The system is organized as a modular pipeline with distinct stages for data ingestion, feature engineering, model training, evaluation, and deployment. Each stage is independently runnable and produces artifacts consumed by the next stage.
Automatic download of the UCI Student Performance Dataset with synthetic data fallback when the source is unavailable. Handles missing values and type coercion automatically.
Creates derived features including study efficiency (hours × prior GPA), engagement score (extracurricular + attendance composite), risk index, and grade category bins.
Trains and compares 6 algorithms: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and SVM. sklearn pipelines with median imputation, standard scaling, and one-hot encoding.
Comprehensive metrics comparison across all models. Hyperparameter tuning via grid search. Artifact persistence via joblib for models, preprocessors, and metrics.
Core language for all data processing and model training
Pipeline construction, preprocessing, and model implementations
Data manipulation, aggregation, and numerical operations
Interactive web UI for real-time predictions with visualizations
Gradient boosting framework for high-performance classification
Feature importance plots, confusion matrices, and distribution charts
Model serialization and artifact persistence
Top predictors identified: previous grades (G1, G2), study time, and attendance. The Gradient Boosting classifier achieved the highest accuracy, while Random Forest regressor produced the best R² score.
Small dataset size: The UCI dataset contains only 649 samples with 33 features, limiting model complexity and increasing overfitting risk.
Used cross-validation (5-fold), synthetic data augmentation, and prioritized simpler models (Random Forest over deep learning). Feature engineering created informative derived features.
Imbalanced classes: Pass/fail distribution skewed toward passing students, which could bias the classifier toward the majority class.
Applied class weighting in the loss function and evaluated using precision, recall, and F1 score in addition to accuracy to ensure balanced performance.
Data source reliability: The UCI dataset URL occasionally changes or becomes unavailable, breaking the pipeline.
Built a synthetic data generator that creates realistic student data as a fallback. The pipeline detects download failures and switches to synthetic data automatically.
Single-command execution: python train.py runs the full pipeline from data ingestion through model evaluation. All artifacts are versioned and stored in artifacts/.
Streamlit app provides real-time predictions with feature sliders, model comparison views, and feature importance visualization. No coding required to use.
All preprocessing (imputation, scaling, encoding) is bundled into sklearn Pipeline objects, ensuring consistent transformation between training and inference.
Six models compared across 4 metrics (R², RMSE, MAE, Accuracy) with automatic best-model selection. Results logged to a comparison CSV for analysis.
Feature engineering had significantly more impact than model selection — engineered features (study efficiency, risk index) improved R² by 0.12 over raw features alone. Cross-validation was essential for reliable performance estimates on the small dataset. The synthetic data fallback proved valuable during development and demonstrations when the source dataset was unavailable.