GradesOLS

GradesOLS: Predictive Modeling of Student Performance

📖 Purpose of Study

Can we predict a student’s final academic outcome before they even sit for the exam?

The primary objective of this project is to construct a Multiple Linear Regression (OLS) model to predict the final grade (G3) of secondary school students. Beyond simple prediction, this study aims to quantify the influence of various factors—ranging from demographic background to study habits—on academic success. This repo serves as a case study in the end-to-end Data Science workflow, demonstrating how to move from raw data to a mathematically robust and stable model by handling multicollinearity, performing feature selection, and interpreting statistical diagnostics.

📊 The Data

The dataset (Grades.csv) comprises academic, demographic, and behavioral records of 395 secondary school students. The target variable is G3 (Final Grade).

G1 (First Period Grade): Grade for the first academic period (numeric: 0 to 20).
G2 (Second Period Grade): Grade for the second academic period (numeric: 0 to 20)
HorasDeEstudio (Study Time): Weekly study time the ranges:
- 1: <2 hours
- 2: 2 to 5 hours
- 3: 5 to 10 hours
- 4: >10 hours
Reprobadas (Failures): Number of past class failures. A proxy for historic academic struggle.
Faltas (Absences): Number of school absences. A continuous variable measuring student attendance/engagement.
Escuela (School): Student’s school (‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira).
Sexo (Gender): Student’s sex (‘F’ - Female or ‘M’ - Male).
Edad (Age): Student’s age (numeric: 15 to 22).
Internet: Internet access at home (‘yes’ or ‘no’).

🛠 Main Conceptual Applications

The core purpose of this notebook is to rigorously apply statistical learning concepts to clean, analyze, and model, real-world data. The key technical applications shown are:

Data Transformation: Handling categorical variables (One-Hot Encoding) and standardizing numerical inputs.
Exploratory Data Analysis (EDA): detecting distributions, outliers, and correlations.
Feature Selection: Implementing Forward Selection and Recursive Feature Elimination to identify the optimal predictor set.
Diagnostic Metrics:
- $R^2$ & Adjusted $R^2$: Evaluating explanatory power.
- RMSE & MAE: Assessing predictive error on unseen data.
- VIF (Variance Inflation Factor): Detecting multicollinearity.
- Condition Number: Assessing model stability.
Visualization: Using Box Plots for outlier detection and Line Plots for the “Elbow Method” in feature selection.

🚀 Key Findings

Here are some of the main discoveries, but they arent the only ones and the whole analysis can be found in the notebook and study case which follows a narrative of the paths, justifications and decisons taken to end up in the best model with the tools specified for this. The main findings, however, are:

The strongest predictor of final success is past performance. Academic trajectory is rarely volatile.
Adding more variables (like Age or Study Time) offered negligible accuracy gains but introduced mathematical instability. A simpler 3-variable model was far more robust.
The relationship between absences and grades is not strictly linear/negative, challenging traditional behavioral assumptions.

Project Files

GradesOLS.ipynb: The full analysis, visualizations and code.
GradesOLS.html: A quick-look version of the results.
Grades.csv: The raw data.