Introduction¶

In this notebook we will create a regression model and do an Exploratory Data Analysis (EDA) on a dataset centered on the grades of students in two potuguese schools, especially the grades are for mathemtaics and portuguese language subjects (2008). This data set has a total of 33 columns, of possible features, of both numerical and categorical data.

The study will follow the following flow:

  1. Data Aquisition and Overview
  2. Data Exploration and Comprehension
  3. Data Cleaning
  4. Correlation Analysis
  5. Feature Selection
  6. Training and Testing of Model
  7. Reflections and Conclusions

We begin by importing the necessary libraries and loading the dataset with the help of pandas. We use the pandas function read_csv and the specific path to our data csv file to save it into a variable data we will use throughout this notebook. This data comes and is being loaded from the same folder in google Drive.

In [29]:
# Give Access to google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

The first step in our exploration is to visualize the top 5 rows of the dataset using the .head() method. This is a quality-control step to confirm that the data was loaded correctly. In these first few rows, we can observe interesting facts like:

The gender (Sexo) is defined as F (female) and M (male). We are going to have to need to change the Internet column to a binary float one. Also, we are ging to have to encode some of the relevant categorical columns as it could be, for example, the school (Escuela)

In [30]:
import pandas as pd

# create the data path
grade_df = pd.read_csv("/content/drive/My Drive/Colab Notebooks/AI/A1.3 Calificaciones.csv")

# print to confirm the import and get a main idea on the data structure
grade_df.head()
Out[30]:
Escuela Sexo Edad HorasDeEstudio Reprobadas Internet Faltas G1 G2 G3
0 GP F 18 2 0 no 6 5 6 6
1 GP F 17 2 0 yes 4 5 5 6
2 GP F 15 2 3 yes 10 7 8 10
3 GP F 15 3 0 yes 2 15 14 15
4 GP F 16 2 0 no 4 6 10 10

Prior to exploration, we initialize the dataset dimensions using .shape. By extracting the row and column counts via indices [0] and [1], we store these global variables to track data consistency throughout the preprocessing pipeline.

In the study we are working with 10 columns (9 possible features and 1 target) and 395 (hopefully good) observations.

In [31]:
rows = grade_df.shape[0]
cols = grade_df.shape[1]

print(f"Rows: {rows} \nColumns: {cols}")
Rows: 395 
Columns: 10

Data Exploration and Comprehension¶

We begin our exploration by profiling the dataset's structure to validate the data types and integrity. Using the .info() method, we identified a clean schema with no missing values, eliminating the need for imputation. We have classified the features into two primary groups for subsequent preprocessing:

  • Categorical Features: escuela, sexo, and internet (requiring encoding).

  • Numerical Features: edad, HorasDeEstudio, Reprobadas, Faltas, and the grade variables G1, G2, and G3.

In [32]:
# get the info of the colums
grade_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Escuela         395 non-null    object
 1   Sexo            395 non-null    object
 2   Edad            395 non-null    int64 
 3   HorasDeEstudio  395 non-null    int64 
 4   Reprobadas      395 non-null    int64 
 5   Internet        395 non-null    object
 6   Faltas          395 non-null    int64 
 7   G1              395 non-null    int64 
 8   G2              395 non-null    int64 
 9   G3              395 non-null    int64 
dtypes: int64(7), object(3)
memory usage: 31.0+ KB

During the data cleaning phase, we evaluate the cardinality of our categorical features to determine the most effective encoding method. High cardinality (e.g., columns with hundreds of unique names) can lead to an excessively sparse feature matrix. To do this, we first isolate the categorical features into a reusable variable, cat_cols, using select_dtypes (for type = object). We then apply the .nunique() function to quantify the unique, non-null, observations within these columns.

Additionally we will see the diversity of our categorical data with a frequency analysis using .value_counts(), as this provides the data present in each category and how many of each there are. This was done with the help of a loop that runs through all the values in cat_cols and prints them in the following way: print(f"{data[c].value_counts()}\n"), c being the specific column name.

Observations

  • Low Cardinality: this confirms that the categorical features possess low cardinality. This is ideal for One-Hot Encoding (creating binary dummy variables), as it allows us to get clear, interpretable coefficients without significantly increasing the model's dimensionality.
  • School (escuela): Significant skew toward the GP institution (349 vs. 46). The model may yield highly reliable predictions for GP students, but its performance on MS students is statistically weaker due to the sample size.
  • Internet Access: A heavy majority of students (329) have internet access compared to only 66 without. This limited "No" sample might make it difficult for the regression to isolate the true effect of internet access on grades.
  • Gender (sexo): This is the most balanced feature (208 F / 187 M), ensuring that any identified gender-based trends are robust and less prone to sample bias.
In [33]:
# Save categorcal object cols
cat_cols = grade_df.select_dtypes(include=['object']).columns

# get the unique vals
print(grade_df[cat_cols].nunique())
Escuela     2
Sexo        2
Internet    2
dtype: int64
In [34]:
# loop through categorical cols
for c in cat_cols:
  # print the amount of observations per category
  print(f"{grade_df[c].value_counts()}\n")
Escuela
GP    349
MS     46
Name: count, dtype: int64

Sexo
F    208
M    187
Name: count, dtype: int64

Internet
yes    329
no      66
Name: count, dtype: int64

In [35]:
# Check the mean of the target variable for each school
print(grade_df.groupby('Escuela')['G3'].mean())
Escuela
GP    10.489971
MS     9.847826
Name: G3, dtype: float64

To transition into the numerical analysis phase, we utilize the .describe() method. This allows us to inspect the central tendency (mean, median) and the spread (standard deviation, quartiles) of our variables, providing the foundation for detecting skewness and potential outliers (with the turkey method).

Observations

  1. Absences (Faltas): The variable Faltas exhibits a standard deviation of $8.003$, which exceeds its mean of $5.709$. This indicates a highly skewed distribution with heavy right-tail dispersion. While the maximum value of $75$ absences is technically feasible within a standard academic semester, it represents an extreme outlier that could influence the regression line.
  2. Grading Schema: The target variables (G1, G2, G3) follow a 0–20 scale. With means hovering around $10.4–10.9$, the data suggests a central tendency near the pass. However, the $0$ minimum in G2 and G3 warrants further investigation to determine if these represent true zeros or missing data disguised as zeros.
  3. Feature Ranges: Variables such as Age (Edad) with $15–22$ and HoursOfStudy (HorasDeEstudio) with $1–4$ show low variance, suggesting they may act as stable predictors. Conversely, the higher variance in grades (G3 $\sigma \approx 4.58$) indicates sufficient diversity in the target variable to attempt a regression analysis.
In [36]:
# show mathematical summary of the data
grade_df.describe()
Out[36]:
Edad HorasDeEstudio Reprobadas Faltas G1 G2 G3
count 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000
mean 16.696203 2.035443 0.334177 5.708861 10.908861 10.713924 10.415190
std 1.276043 0.839240 0.743651 8.003096 3.319195 3.761505 4.581443
min 15.000000 1.000000 0.000000 0.000000 3.000000 0.000000 0.000000
25% 16.000000 1.000000 0.000000 0.000000 8.000000 9.000000 8.000000
50% 17.000000 2.000000 0.000000 4.000000 11.000000 11.000000 11.000000
75% 18.000000 2.000000 0.000000 8.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 3.000000 75.000000 19.000000 19.000000 20.000000

To evaluate the structural readiness of our features and visualize the data, we generated boxplots for all numerical variables using Matplotlib and Seaborn. By passing the dataframe to sns.boxplot(ax=axes), we can identify dispersion patterns and potential regression blockers:

Observations

  • Outliers in Faltas (Absences): The visualization confirms that the extreme value (75) is not isolated; rather, there is a cluster of high-absence observations. This "tail" suggests a log-transformation is preferable over simple clipping (which works better for a "crazy" case).

  • Performance (G1 → G3): While the numerical standard deviation suggested a spread, the visual scale reveals a significant widening of the interquartile range in G3. This indicates that student's performance becomes more polarized as the semester progresses.

  • Multicollinearity: The near-identical positioning and centering of the G1, G2, and G3 boxes indicate high feature redundancy. Including all three in a single OLS model would likely trigger high VIF (Variance Inflation Factor).

  • Feature Sparsity in Reprobadas: The "invisible" box for failed courses confirms a lack of variance; since most observations are zero, this feature may provide a weak predictive signal.

  • Stable Predictors: Age (Edad) and Hours of Study (HorasDeEstudio) show compact, well-centered distributions with minimal outliers, marking them as stable candidates for our baseline model.

In [37]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))

sns.boxplot(data=grade_df.select_dtypes(include='number'), palette="Set2", orient='h')

plt.title("Comparison of All Numerical Feature Distributions")
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Data cleaning and Tranformation¶

To deal with the identified right-skew in the Faltas (Absences) column, we will apply a Logarithmic Transformation. The Transformation Logic is that Logs effectively "compress" extreme values (squashing the big numbers) and "expand" the lower range (stretching the small ones). We use np.log1p(), which calculates $log(x + 1)$. The addition of $+1$ is essential because $log(0)$ is mathematically undefined, and a large portion of our student population has zero absences.

To evaluate the effectiveness of this transformation, we will compare the Kernel Density Estimation (KDE) before and after the change. The KDE provides a smoothed, continuous probability density curve, allowing us to see if we have successfully reduced the "tail" and centered the distribution. By removing the heavy skewness, we reduce the leverage of extreme observations, ensuring that our regression coefficients are more representative of the general student population.

In [38]:
import numpy as np

# lets see the before histogram
sns.histplot(grade_df['Faltas'], kde=True)
plt.title("Faltas before Log Transformation")
Out[38]:
Text(0.5, 1.0, 'Faltas before Log Transformation')
No description has been provided for this image
In [39]:
# do the log tarsnformation
grade_df['Faltas'] = np.log1p(grade_df['Faltas'])

# lets see if it improved
sns.histplot(grade_df['Faltas'], kde=True)
plt.title("Faltas after Log Transformation")
Out[39]:
Text(0.5, 1.0, 'Faltas after Log Transformation')
No description has been provided for this image

Feature Transformation¶

To incorporate qualitative features into our linear regression model, we must transform the categorical variables escuela, sexo, and internet into a numerical format. We utilize One-Hot Encoding via pd.get_dummies(), which converts each category into a binary indicator. We apply the transformation to the variables stored in cat_cols. Instead of manual looping, pd.get_dummies efficiently processes the specified columns and merges them back into our primary dataframe. We set drop_first=True to ensure the model remains mathematically sound. By representing $k$ categories with $k-1$ dummy variables, we prevent perfect multicollinearity (when one feature can be perfectly predicted from the others). The "dropped" category (e.g., Female or No Internet) serves as our reference baseline, against which the coefficients of the remaining categories are interpreted. Aother note is that the dtype is set to int to allow a numerical binary flag (0 or 1) and not a True/False

Now our grade_df become grades_df_dummies, a dataframe with new columns named escuela_MS, sexo_M, and internet_yes.

In [40]:
# generate the dummy varaibles, concat the new columns and drop the initial and the double
grades_df_dummies = pd.get_dummies(grade_df, columns=cat_cols, drop_first=True, dtype=int)

# print head to check if its right
grades_df_dummies.head()
Out[40]:
Edad HorasDeEstudio Reprobadas Faltas G1 G2 G3 Escuela_MS Sexo_M Internet_yes
0 18 2 0 1.945910 5 6 6 0 0 0
1 17 2 0 1.609438 5 5 6 0 0 1
2 15 2 3 2.397895 7 8 10 0 0 1
3 15 3 0 1.098612 15 14 15 0 0 1
4 16 2 0 1.609438 6 10 10 0 0 0

Correlation Analysis¶

Having finalized our outlier mitigation and categorical encoding, we now transition to Correlation Analysis. Our primary objective is to build a model that identifies behavioral and environmental drivers of academic success, rather than one that merely replicates previous testing trends.

To prepare for automated feature selection, we utilize a correlation matrix to perform a baseline diagnostic check. This allows us to quantify the linear relationships between our predictors and the target variable (G3), while simultaneously screening for Multicollinearity. We calculated the Pearson Correlation Coefficient using the .corr() function and visualized the results using a Seaborn heatmap with the annotations set to True to see the value of the correlation. This coefficient provides a standardized value between -1 and 1:

  • 1.0: A perfect direct relationship (e.g., as Study Hours increase, Grades increase).
  • 0.0 : no linear pattern exists.
  • -1.0 : A perfect inverse relationship (e.g., as Absences increase, Grades decrease).
In [41]:
# generate the matrix, we dont need just positives or negatives
corrs_matrix = grades_df_dummies.corr()

# fill the diagonal with ceros
np.fill_diagonal(corrs_matrix.values, 0)

# import seaborn
import seaborn as sns

#visualize with a heatmap
sns.heatmap(corrs_matrix, annot=True, cmap='coolwarm', fmt=".2f")
Out[41]:
<Axes: >
No description has been provided for this image

Observations

  1. Multicollinearity: Both g1 and g2 are very predictive of g3. But g1 and g2 are also strongly correlated with each other thus we have to do something as there's a multicollinearity risk. So the issue is not that G1 and G1 correlate with G1; that’s actually great. The issue is G1 ↔ G2 = 0.85, which means they contain overlapping information. This can cause:
    • Unstable coefficients in linear models
    • Harder interpretation
    • Redundant information
  2. Main Predictor: With a correlation of 0.90 the grade usually obtained in the second period (G2) is the single most powerful predictor of the final grade (G3). This represents Momentum, aka if a student performing well mid-semester rarely crashes at the end. The heatmap shows G1 is also strong (0.80), but G2 is the superior version of the same signal.
  3. Low Inter-Predictor Correlation: Aside from the main correlation squares, the matrix shows relatively low correlations between all other independent variables. The second to highest observed value is 0.38 (between escuela and Edad). Since this relationship is likely demographic rather than structural, it does not pose a threat to our model’s stability.
  4. Academic Effect: As hypothesized, there is a clear inverse relationship of -0.36 between failed courses (Reprobadas) and the final grade (G3). This confirms that previous academic friction is a significant negative predictor of future success.
  5. Gender Study Patterns: We observe a correlation of 0.31 regarding study habits and gender. This, based on our encoding, suggests a specific group, females, tends to allocate more time to work, providing a subtle but relevant predictive signal.

Dealing with Multicollinarity¶

This is a grand issue in which a lot of possible solutions can happen even depending on the model for random forests, neural networks and even in some cases we could keep both. However, for regression models the possible solutions are to either drop one, do a regularization or combine them into a single one to keep both of their data just transformed. Thus, due to the high correlation between G1 and G2 (r = 0.85) and both variables represent midterm exam performance, they were combined into a single feature representing overall midterm achievement. This reduced redundancy, improved coefficient stability, enhanced interpretability of the multivariable linear regression model and produces a more reliable representation of overall midterm performance.

The result of the combination is sored on a 'G1_G2' variable that averages both grades to say, if student X did good in past performance of the semester, they will proably do good in the final grade. We appended this variable in the grades_df_dummies dataframe.

In [42]:
# combined of correlated G1 and G2 variable to not increase the size
grades_df_dummies['G1_G2'] = (grades_df_dummies['G1'] * grades_df_dummies['G2']) / 2

# drop the combined columns
grades_df_dummies = grades_df_dummies.drop(columns=['G1', 'G2'])

Training and Testing of Model¶

To evaluate our model reliably and prevent data leakage, we partitioned the dataset following standard machine learning practices. We divided the data into three distinct subsets, ensuring that the model is never evaluated on data it has already seen during training.

The data roles are defined as follows:

  • Training Set: Used to fit the candidate models (finding the coefficients).
  • Validation Set: Used for feature selection and hyperparameter tuning (comparing Model A vs. Model B).
  • Test Set: Reserved exclusively for the final performance assessment.

We utilized the train_test_split function from scikit-learn. We set test_size=0.2 (allocating 20% for testing) and fixed the random_state=42. This ensures reproducibility, guaranteeing that our splits remain consistent every time the code is run, which is crucial for valid standardization and comparisons.

Resulting Variables: X_train, X_val, X_test, y_train, y_val, y_test

In [43]:
# import sklearn
from sklearn.model_selection import train_test_split

# create the subsets
X = grades_df_dummies.drop('G3', axis=1)
y = grades_df_dummies['G3']

# split the data 80/20
X_train_main, X_test, y_train_main, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# split the train data into train and validation
X_train, X_train_val, y_train, y_train_val = train_test_split(X_train_main, y_train_main, test_size=0.2, random_state=42)

# print the size of the objects
print(f"Train Sample\n {X_train.shape}, {y_train.shape}")

print(f"\nTrain Validation Sample\n {X_train_val.shape}, {y_train_val.shape}")

print(f"\nTest Sample")
X_test.shape, y_test.shape
Train Sample
 (252, 8), (252,)

Train Validation Sample
 (64, 8), (64,)

Test Sample
Out[43]:
((79, 8), (79,))

To quantitatively identify the most predictive variables, we calculated the Pearson correlation coefficient ($r$) between each independent feature in the training set (X_train) and the target variable (y_train).We iterated through every feature column, computing the correlation statistic using scipy.stats. Crucially, we calculated the absolute value (abs) because for feature selection, the strength of the relationship matters more than the direction; a strong negative correlation (e.g., -0.8) is just as predictive as a strong positive one.

The correlation vector calculation creates a ranked list of features based on their linear relationship with the target. We calculate $n$ (number of features) and create an empty vector to store our scores. We then iterate through every column in X_train and we calculate the Pearson correlation with y_train. This was done because this works as a filter, it tells you who is worth inviting to the team.

The resulting coefficients were mapped to their respective column names (using a pandas Series function) and sorted in descending order. This generates a prioritized "leaderboard", saved in the corr_results series, allowing us to identify the top predictors at a glance.

Results

After computing the Pearson correlation coefficient ($|r|$) for all candidates, we ranked the features by their "power" relative to the final grade (G3).

  1. G1_G2 ($r \approx 0.87$) is the strongest feature and predictor. It confirms that a student's performance in the final period is highly dependent on their performance in previous periods.
  2. Reprobadas ($r \approx 0.40$) failures in the past are the second most important signal. While the calculation shows the absolute magnitude, logically this relationship is negative (more failures $\rightarrow$ lower grades). It serves as a strong definer for long-term academic struggle.
  3. Faltas ($r \approx 0.20$), aka Absences, show a moderate impact. This variable captures student engagement, though its influence is half that of past failures.
  4. Edad ($r \approx 0.16$) & Sexo_M ($r \approx 0.14$): Demographic factors like Age and Gender have a low predictive power. This suggests that the model is evaluating students primarily on their merit and history, rather than their identity.
  5. HorasDeEstudio ($r \approx 0.11$): Surprisingly, reported study time is a weak predictor. This implies that "quality" of study or natural aptitude may be more relevant than the raw quantity of hours spent.
  6. Internet ($0.07$) & Escuela ($0.03$): These variables have correlations near zero. Whether a student has internet at home or which specific school they attend provides almost no information about their final grade in this dataset.

Conclusion for Feature Selection: Based on this ranking, we can safely prioritize the top 3-4 features (G1_G2, Reprobadas, Faltas, and potentially Edad) for the model. Variables like Internet and Escuela are likely noise and can be excluded to reduce model complexity.

In [44]:
# import pearson r
import scipy.stats as stats

# get the amount of variables (n)
n = X_train.shape[1]

# initialize with zeros
corr_vector = np.zeros(n)

# get corr between y and each X trai column
for i in range(n):
  # pearsonr returns (statistic, pvalue)
  corr_vector[i] = abs(stats.pearsonr(X_train.iloc[:, i], y_train)[0])

# create a Pandas Series: Data=Your Vector, Index=Column Names
corr_results = pd.Series(corr_vector, index=X_train.columns)

# order the results from biggest to smallest
corr_results = corr_results.sort_values(ascending=False)

# reults of the array
print(corr_results)
G1_G2             0.867706
Reprobadas        0.395822
Faltas            0.203502
Edad              0.158143
Sexo_M            0.135934
HorasDeEstudio    0.115188
Internet_yes      0.073236
Escuela_MS        0.036210
dtype: float64

Now that we have our features ranked by correlation, it’s time to see exactly how many of them we actually need. We don't want to guess, so we are going to build a loop to test them one by one and see where the model stops improving. Here we initialize an array of zeros to store our scores. Then, we loop through our sorted features list. In every step/iteration, we add the next best variable to our X_temp, fit a new OLS model, and grab the Adjusted R-squared.

Why Adjusted R-squared?

We use this instead of the normal $R^2$ because we need that penalization. If we just used normal $R^2$, the line would keep going up forever even if we added junk data. The Adjusted version acts as a filter, it penalizes the score every time we add a variable. This lets us see if the new feature actually adds value or if it's just adding complexity for no reason. Finally, we generate a graph to visualize the trend. We are looking for the specific point where the line stops sloping upwards (often called the "elbow"). That peak tells us the perfect amount of variables to keep before the model starts overfitting.

In [45]:
# import library
import statsmodels.api as sm

sorted_features = corr_results.index.tolist()

# initialize with zeros the array
n = len(sorted_features)
r_adjusted = np.zeros(n)

# loop to get the r adjusteds for the models
for i in range(n):
  current_features = sorted_features[:i+1]

  # get the temporary features
  X_temp = X_train[current_features]

  # define the model
  temp_model = sm.OLS(y_train, sm.add_constant(X_temp))

  # fit the model
  temp_results = temp_model.fit()

  # save the radjusted variable for this
  r_adjusted[i] = round(temp_results.rsquared_adj, 5)

# print resils
print(f"{r_adjusted}\n")

# generate a visualization for it
plt.plot(r_adjusted,'o-', c='purple')
[0.75193 0.76263 0.79354 0.80211 0.80368 0.80296 0.80287 0.80261]

Out[45]:
[<matplotlib.lines.Line2D at 0x7d3a74db5b20>]
No description has been provided for this image

Automated Revision¶

Now that we have done our manual analysis with the correlation loop, we can verify the findings. We want to see if a premade function picks the exact same variables that we identified. To do this, we use SequentialFeatureSelector (SFS) from Scikit-Learn. SFS acts like a robot that tests team combinations. We set direction='forward', so it starts with 0 features. It tests every single variable to see which one works best with our base LinearRegression() model. Once it picks the first variable, it loops again to find the best second match to match the captain, and so on, until it reaches the number we asked for (n_features_to_select=4).

We apply sfs.fit(X_train, y_train) because the selection process is part of the training. If we used the validation data to pick features, we would be "cheating" (Data Leakage) because we are tailoring the model to the test it hasn't taken yet.

After the results we looked at correlations and the "Elbow" graph results vs the computer mathematically calculates the best combination. The library selected ['Edad', 'Reprobadas', 'Faltas', 'G1_G2'] which effectively confirms our manual analysis.

In [46]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector

model = LinearRegression()

# do the forward selection
sfs = SequentialFeatureSelector(model, n_features_to_select=4, direction='forward')

# fit the model
sfs.fit(X_train, y_train)

selected_columns = X_train.columns[sfs.get_support()].tolist()

# Create the clean datasets with ONLY the selected features
X_train_selected = X_train[selected_columns]
X_validation_selected = X_train_val[selected_columns]

print(selected_columns)
['Edad', 'Reprobadas', 'Faltas', 'G1_G2']

Now to understand how our main features work we have to analyze it. While sklearn is great for making predictions, it acts like a "black box" it doesn't easily show us the detailed math. So, we are switching to statsmodels for this step. This library gives us the metrics (like P-values and confidence intervals) that we need to validate our choices.

The process of the model sklearn adds the intercept ($b_0$) automatically behind the scenes. statsmodels doesnt, it assumes the line goes through the origin $(0,0)$ unless we tell it otherwise. By using add_constant, we manually add a column of 1s to the data. This allows the model to calculate the Intercept (or bias). Then, we define the model using OLS (Ordinary Least Squares). This is the standard algorithm that tries to draw the "best fit line" by minimizing the squared errors between the data points. Finally, we fit the model and print the summary. This table doesn't just tell us if the model is accurate ($R^2$); it tells us:

  • Coefficients: How many points each variable adds or subtracts from the grade.
  • P-values ($P>|t|$): Are these variables actually statistically significant, or did we just get lucky?
  • Condition Number: Warnings about Multicollinearity.
  • F-statistic: tells us how significant this selection of variables is at predicting.

Results:¶

  1. The R-squared tells us how much of the variance is explained by the model; it indicates that we can explain more than 80% of the data variability. This is very high, and thus we can say that the model is quite accurate.
  2. The F-statistic of 255.3 is overwhelming evidence. Combined with the Prob (F-statistic) of 2.78e-85 (which is practically zero), we can confidently reject the null hypothesis.
  3. The individual p-values tell us that all the variables are good predictors, as they contribute statistically significant information. All of the values are well below the 0.05 threshold.
  4. However, we observe that the condition number is large, meaning there is likely multicollinearity among our variables. The value of 1.11e+03 implies that the model is highly sensitive to noise. Small changes in the training data could cause massive, erratic swings in the coefficients, making the model untrustworthy.
In [47]:
import statsmodels.api as sm

# define the variable now (this helps s get the statistics)

X_train_stats = sm.add_constant(X_train_selected)

# define the model
model = sm.OLS(y_train, X_train_stats)

# fit the model
ols_model = model.fit()

# lets print the summary with the variables both we and the forward selection function gave
print(ols_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     G3   R-squared:                       0.805
Model:                            OLS   Adj. R-squared:                  0.802
Method:                 Least Squares   F-statistic:                     255.3
Date:                Sun, 01 Feb 2026   Prob (F-statistic):           1.78e-86
Time:                        14:50:22   Log-Likelihood:                -542.86
No. Observations:                 252   AIC:                             1096.
Df Residuals:                     247   BIC:                             1113.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.3517      1.873      4.994      0.000       5.663      13.040
Edad          -0.3822      0.112     -3.427      0.001      -0.602      -0.163
Reprobadas    -0.5512      0.191     -2.880      0.004      -0.928      -0.174
Faltas         0.8266      0.123      6.693      0.000       0.583       1.070
G1_G2          0.1005      0.004     27.657      0.000       0.093       0.108
==============================================================================
Omnibus:                       34.259   Durbin-Watson:                   1.947
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               47.390
Skew:                          -0.866   Prob(JB):                     5.12e-11
Kurtosis:                       4.230   Cond. No.                     1.11e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

After seeing the multicollinearity it is important to analyse the individual relationships. VIF measures how much a variable is "copying" the other variables. VIF = 1, The variable is unique and brings 100% new information. A VIF > 10, the variable is useless noise. We coded it using the variance_inflation_factor function from the statsmodels library. As we can't check the whole table at once, so we built a list comprehension (a fast loop) that goes through every column i in our training data (X_train_selected). Thw ooutput is saved into a clean Pandas DataFrame so we can see the "Guilty" variables next to their names.

Results:¶

We observed that Edad (Age) had a VIF score significantly higher than the safety threshold (likely around 6 or 7). Which confirms that Age is statistically redundant. The model can already figure out a student's "maturity/risk" level just by looking at Reprobadas (Failed Courses) and Grades.

Thus, since Reprobadas is a stronger predictor (lower P-value), these result gives us the mathematical proof we need to drop Edad from the model.

In [48]:
# check fo the multicollinearity because of the coefficient of Faltas
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X_train_selected.columns

vif_data["VIF"] = [variance_inflation_factor(X_train_selected.values, i)
                          for i in range(len(X_train_selected.columns))]
print(vif_data)
      feature       VIF
0        Edad  6.871791
1  Reprobadas  1.409975
2      Faltas  2.709557
3       G1_G2  4.321990

Second Model Evaluations¶

After removing the redundant variable Edad, we re-trained the model using only our top 3 feature variables: Reprobadas, Faltas, and G1_G2. We now inspect the ols_model_2.summary() to see if the trade-off was worth it.

Observations

  1. High Explanation Power ($R^2 = 0.796$): The model explains nearly 80% of the variance in the final grades. Despite removing Edad, the predictive power remains virtually identical to the 4-variable model (0.805), confirming that the removed variable contributed with not much vluable information.
  2. Statistical Significance ($F = 322.6$): The F-statistic increased significantly (from ~255 to 322.6), indicating that this 3-variable combination is more efficient than the previous model.
  3. Strong-er Individual Predictors ($P < 0.001$): All three remaining variables show P-values effectively at zero.
  4. G1_G2 (Coef: 0.10): Remains the dominant driver; for every point increase in previous grades, the final grade rises by 0.10.
  5. Reprobadas (Coef: -0.66): The negative impact of past failures is now more pronounced, suggesting it fully absorbed the "struggle" signal previously split with Age.
  6. Faltas (Coef: 0.77): Retains its positive coefficient, confirming the unique "high-performer absenteeism" pattern in this specific dataset.
  7. Improved Stability (Cond. No. = 212): This is the critical improvement. The Condition Number dropped from ~1100 to a stable 212. By eliminating Multicollinearity, the model's coefficients are less sensitive to small fluctuations in the input data.
In [49]:
import statsmodels.api as sm

# define the variable now (this helps s get the statistics)
X_train_stats_2 = sm.add_constant(X_train[['Reprobadas', 'Faltas', 'G1_G2']])

# define the model
model_2 = sm.OLS(y_train, X_train_stats_2)

# fit the model
ols_model_2 = model_2.fit()

# lets print the summary with the variables both we and the forward selection function gave
print(ols_model_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     G3   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.794
Method:                 Least Squares   F-statistic:                     322.6
Date:                Sun, 01 Feb 2026   Prob (F-statistic):           2.78e-85
Time:                        14:50:22   Log-Likelihood:                -548.71
No. Observations:                 252   AIC:                             1105.
Df Residuals:                     248   BIC:                             1120.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0467      0.355      8.571      0.000       2.347       3.747
Reprobadas    -0.6610      0.193     -3.429      0.001      -1.041      -0.281
Faltas         0.7745      0.125      6.187      0.000       0.528       1.021
G1_G2          0.1008      0.004     27.153      0.000       0.093       0.108
==============================================================================
Omnibus:                       42.281   Durbin-Watson:                   1.987
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               71.537
Skew:                          -0.924   Prob(JB):                     2.92e-16
Kurtosis:                       4.844   Cond. No.                         212.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Validation & Comaprison¶

Now we arrive to testing our models against the Validation Set. This data has been left "untouched" during training, allowing us to honestly evaluate how effective our models are at predicting grades for students they have never seen before.

We calculated two key error metrics to assess performance:

  1. RMSE (Root Mean Squared Error): Measures the "strict" error, punishing large outliers.
  2. MAE (Mean Absolute Error): Measures the average "distance" between our prediction and the real grade.

We used the sklearn.metrics library to compute mean_squared_error and mean_absolute_error. We fed the functions our actual values (y_train_val) and our predicted values (y_val_predictions), then used numpy to take the square root for the RMSE. Finally, we calculated the Error Percentage relative to the total data range to put the mistake in perspective ($y.max() - y.min()$).

The Results:

  • Model 1 (4 Variables): RMSE ~1.94 | MAE ~1.51 | Error ~7.96%
  • Model 2 (3 Variables): RMSE ~1.96 | MAE ~1.50 | Error ~7.91%
In [50]:
# add constant to se the stats
X_validation_selected = sm.add_constant(X_validation_selected)

# make the precictions with the previously done model
y_val_predictions = ols_model.predict(X_validation_selected)

# import the library for the MAE RMSE metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_train_val, y_val_predictions))

# get the mae, nicer as it avgs the mistakes
mae = mean_absolute_error(y_train_val, y_val_predictions)

# get the data range to see hwo big it is
data_range = y_train_val.max() - y_train_val.min()

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"Data range: {data_range}")
print(f"Error %: {mae / data_range * 100:.2f}")
RMSE: 1.9419
MAE: 1.5126
Data range: 19
Error %: 7.96
In [51]:
# add constant to se the stats
X_validation_selected_2 = sm.add_constant(X_train_val[['Reprobadas', 'Faltas', 'G1_G2']])

# make the precictions with the previously done model
y_val_predictions_2 = ols_model_2.predict(X_validation_selected_2)

# import the library for the MAE RMSE metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_train_val, y_val_predictions_2))

# get the mae, nicer as it avgs the mistakes
mae = mean_absolute_error(y_train_val, y_val_predictions_2)

# get the data range to see hwo big it is
data_range = y_train_val.max() - y_train_val.min()

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"Data range: {data_range}")
print(f"Error %: {mae / data_range * 100:.2f}")
RMSE: 1.9635
MAE: 1.5037
Data range: 19
Error %: 7.91

Testing¶

Finally, after verifying our model with the validation set, we must now test it against the Test Data (X_test). This data was completely isolated from the start, ensuring that our final metrics represent how the model would perform in the real world on completely new students.We applied our winning model (Model 2: Reprobadas, Faltas, G1_G2) to the test set using the same error calculations as before:

  • $R^2$ Score: Calculated using the sklearn.metrics function r2_score() to measure how well the model explains the variance.
  • MAE & RMSE: To quantify the magnitude of the error.

Results:

  1. High Explanatory Power ($R^2 \approx 0.79$): The model maintained its strength, successfully explaining roughly 79% of the variation in student grades even on new data.
  2. Error Rate (~9%): The Mean Absolute Error (MAE) and RMSE show a slight increase in error compared to the validation set, bringing the error percentage relative to the data range to approximately 9%.This, however, is statistically normal when moving to the test set. It reflects natural data variation and confirms that while the model is not perfect, it generalizes very well without severe overfitting.
In [52]:
# final testing with test data
features = ['Reprobadas', 'Faltas', 'G1_G2']

# The Final Exam
X_test_final = sm.add_constant(X_test[features])
y_test_pred = ols_model_2.predict(X_test_final)

# data range
data_range = y_test.max() - y_test.min()

# mae caulcation
mae = mean_absolute_error(y_test, y_test_pred)

from sklearn.metrics import r2_score, mean_squared_error

print(f"R2: {r2_score(y_test, y_test_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.4f}")
print(f"MAE: {mae:.4f}")
print(f"Data range: {data_range}")
print(f"Error %: {mae / data_range * 100:.2f}")
R2: 0.7359
RMSE: 2.3269
MAE: 1.7059
Data range: 19
Error %: 8.98

Conclusion¶

In conclusion, both models proved to be highly effective, hovering around an 8% error rate, which is an excellent result for behavioral data. The difference in raw accuracy between the 4-variable model and the 3-variable model was negligible. However, the deciding factor and a crucial takeaway for future modeling was Stability. Model 1 (4 Variables) technically had a slightly lower RMSE, but suffered from a high Condition Number (warning of Multicollinearity). While, Model 2 (3 Variables) achieved kinda the same accuracy but with a superior Condition Number (212 vs. ~1100). Thus selecting model 2 as the 'winner' confirms the principle of Parsimony (Occam's Razor) a simpler model is often better. By removing the redundant variable (Edad), we eliminated the risk of overfitting and created a robust tool that will likely perform better on new, unseen data.

This process highlighted that Data Science is not just about coding; it is about building a narrative, and thus there's great power in preparation and visualization. The Exploratory Data Analysis (EDA) was critical. For instance, in the case of Absences (Faltas), the Box Plot saved us from making a mistake. Without it, we might have blindly applied "Tukey's Method" and dropped high values as "outliers." The visualization revealed that the data was merely skewed, not impossible, allowing us to keep valuable information and could be solved by simple transformations. Another important thing that was first applied here and showed its value thorugh results was the strict division into Training, Validation, and Testing layers. Without the validation set, we would have been choosing features based on bias. This approach ensured that our final error metrics (~9% on the test set) were honest and realistic.

Limitations¶

The primary limitation of this project lies in the model architecture itself, the Linear Regression. We assumed that all relationships are straight lines. In human behavior, this is rarely true especially when talking about behavioural factors. For example, we found that study time was a weak predictor. This might be because the relationship is non-linear. There is likely a "diminishing return" going from 0 to 2 hours of study helps a lot, but going from 8 to 10 hours adds very little value. A linear model cannot capture this "curved" relationship.

Further Study¶

To improve upon this baseline, future iterations of this project could focus on:

  • Non-Linear Models: Implementing algorithms like Random Forests or Polynomial Regression could capture complex patterns that our linear model missed.
  • Interaction Features: We could investigate if other variables work together. For example, Study Time might only be significant for students who have previous Failures. Creating an interaction term (StudyTime * Failures) could reveal these hidden dependencies.
  • Expanded Data Collection: The current "Noise" in the model (the unexplained variance) suggests we are missing key drivers. Collecting data on socio-emotional factors (sleep quality, mental health or others) could significantly reduce the error rate further.