GradesOLS

GradesOLS: Predictive Modeling of Student Performance

📖 Purpose of Study

Can we predict a student’s final academic outcome before they even sit for the exam?

The primary objective of this project is to construct a Multiple Linear Regression (OLS) model to predict the final grade (G3) of secondary school students. Beyond simple prediction, this study aims to quantify the influence of various factors—ranging from demographic background to study habits—on academic success. This repo serves as a case study in the end-to-end Data Science workflow, demonstrating how to move from raw data to a mathematically robust and stable model by handling multicollinearity, performing feature selection, and interpreting statistical diagnostics.

📊 The Data

The dataset (Grades.csv) comprises academic, demographic, and behavioral records of 395 secondary school students. The target variable is G3 (Final Grade).

🛠 Main Conceptual Applications

The core purpose of this notebook is to rigorously apply statistical learning concepts to clean, analyze, and model, real-world data. The key technical applications shown are:

  1. Data Transformation: Handling categorical variables (One-Hot Encoding) and standardizing numerical inputs.
  2. Exploratory Data Analysis (EDA): detecting distributions, outliers, and correlations.
  3. Feature Selection: Implementing Forward Selection and Recursive Feature Elimination to identify the optimal predictor set.
  4. Diagnostic Metrics:
    • $R^2$ & Adjusted $R^2$: Evaluating explanatory power.
    • RMSE & MAE: Assessing predictive error on unseen data.
    • VIF (Variance Inflation Factor): Detecting multicollinearity.
    • Condition Number: Assessing model stability.
  5. Visualization: Using Box Plots for outlier detection and Line Plots for the “Elbow Method” in feature selection.

🚀 Key Findings

Here are some of the main discoveries, but they arent the only ones and the whole analysis can be found in the notebook and study case which follows a narrative of the paths, justifications and decisons taken to end up in the best model with the tools specified for this. The main findings, however, are:

Project Files