Can we predict a studentâs final academic outcome before they even sit for the exam?
The primary objective of this project is to construct a Multiple Linear Regression (OLS) model to predict the final grade (G3) of secondary school students. Beyond simple prediction, this study aims to quantify the influence of various factorsâranging from demographic background to study habitsâon academic success. This repo serves as a case study in the end-to-end Data Science workflow, demonstrating how to move from raw data to a mathematically robust and stable model by handling multicollinearity, performing feature selection, and interpreting statistical diagnostics.
The dataset (Grades.csv) comprises academic, demographic, and behavioral records of 395 secondary school students. The target variable is G3 (Final Grade).
G1 (First Period Grade): Grade for the first academic period (numeric: 0 to 20).G2 (Second Period Grade): Grade for the second academic period (numeric: 0 to 20)HorasDeEstudio (Study Time): Weekly study time the ranges:
Reprobadas (Failures): Number of past class failures. A proxy for historic academic struggle.Faltas (Absences): Number of school absences. A continuous variable measuring student attendance/engagement.Escuela (School): Studentâs school (âGPâ - Gabriel Pereira or âMSâ - Mousinho da Silveira).Sexo (Gender): Studentâs sex (âFâ - Female or âMâ - Male).Edad (Age): Studentâs age (numeric: 15 to 22).Internet: Internet access at home (âyesâ or ânoâ).The core purpose of this notebook is to rigorously apply statistical learning concepts to clean, analyze, and model, real-world data. The key technical applications shown are:
Here are some of the main discoveries, but they arent the only ones and the whole analysis can be found in the notebook and study case which follows a narrative of the paths, justifications and decisons taken to end up in the best model with the tools specified for this. The main findings, however, are: