In [100]:
# Give Access to google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Introduction¶

"Happiness is complex and multifaceted. While economic prosperity (GDP) is often cited as a key determinant, can it alone explain national well-being? This analysis explores in this notebook I will go beyond a Exploratory Data Analysis (EDA), as this will lead into a single variable linear and multivariable linear regression both to get some inferences (or some predictions)

In [101]:
# import the library
import pandas as pd

# read the csv into a variable
csv_path = "/content/drive/My Drive/Colab Notebooks/AI/A1.2 Felicidad y GDP.csv"
happiness_index_data = pd.read_csv(csv_path)

# print the first few lines to test the imports
happiness_index_data.head()
Out[101]:
Pais Felicidad GDP
0 Finland 7.8210 2.718370e+11
1 Denmark 7.6362 3.560850e+11
2 Iceland 7.5575 2.171808e+10
3 Switzerland 7.5116 7.522480e+11
4 Netherlands 7.4149 9.138650e+11

Data Exploration¶

To get to understand the data we are dealing with, how "clean" it is and just get to see what are variables will be we will begin seieng the type of columns we have (numerical or categorical) and the amount of nulls found in each one of them. Using the .info() function, we verified the data schema and memory allocation. The dataset consists of 141 observations (rows), with no missing values detected across our primary features. This high level of data completeness ensures that our regression coefficients will not be biased by incomplete records.

In our data we have the following column categories:

  • Categorical -> Pais (country) This serves as our primary index and geographic identifier.
  • Numerical -> Felicidad (happiness) and GDP. These are the quantitative metrics that will form our Target ($y$) and Predictor ($X$) variables, respectively.
In [102]:
# lets start with seeing the nature of the columns
happiness_index_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141 entries, 0 to 140
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Pais       141 non-null    object 
 1   Felicidad  141 non-null    float64
 2   GDP        141 non-null    float64
dtypes: float64(2), object(1)
memory usage: 3.4+ KB

Now in order to see the impact of the data and distrubtion of the numerical one we will apply a .describe() function that allows to tell us how the data is, see if it makes sense in the analysis as avery big values or GDP of zero could imply that the data isnt correct or that it needs some cleaning.

It is from this that we can identify that:

  • Theres a big bridge on the GDP data: While your mean GDP is roughly $\$588$ billion, your maximum is over $\$20.8$ trillion ($2.08 \times 10^{13}$). This gap confirms that a few "super-economies" (like the USA or China) are massive outliers compared to the rest of the 141 countries. If you don't address this, your regression line will be "pulled" toward that $20$ trillion dollar point, ignoring the trend of the other $140$ countries. This makes it imperative we place a transformation on the values
In [103]:
# descibr the maths of the numerical columns
happiness_index_data.describe()
Out[103]:
Felicidad GDP
count 141.000000 1.410000e+02
mean 5.560004 5.889942e+11
std 1.098011 2.221612e+12
min 2.403800 1.223876e+09
25% 4.887900 1.805117e+10
50% 5.585300 6.215800e+10
75% 6.309100 3.452960e+11
max 7.821000 2.089370e+13

To ensure the integrity of our economic indicators, we performed a validation check on the extreme values within our feature matrix. By isolating the boundaries of our dataset, we can confirm that the observed variances represent real-world economic disparities rather than data entry errors.

  1. Using the nlargest and nsmallest functions and seving the data into their respective richest_country and poorest_country arrays, we printed and identified the countries at the opposite ends of the economic lane:
    • Upper Bound: The two highest observations are the United States ($20.8$T) and China ($14.7$T). These values align with global economic realities, confirming the accuracy of our data source.
    • Lower Bound: The lowest GDP in our sample belongs to Comoros. Cross-referencing this against secondary economic research confirms its ranking as an emerging economy, validating the low of our dataset.

Why do they matter?

The presence of these extreme values introduces two critical challenges for a standard Linear Regression as observations like the USA and China are so far from the mean, they can disproportionately "pull" the regression slope ($\beta$) toward themselves, resulting in a model that reflects their behavior rather than the general population.

Visualization

To quantify this dispersion, we utilize a combination of Matplotlib and Seaborn to generate a Box Plot with the function ns.boxplot. This visualization allows us to see how the data is spread and confirms that the top economies are indeed statistical outliers that require specific handling.

In [104]:
# Isolate the highest GDP country to confirm context
richest_country = happiness_index_data.nlargest(2, 'GDP')
print(f"The maximum GDP of {richest_country['GDP'].values[0]:.2e} belongs to: {richest_country['Pais'].values[0]}")
print(f"The second maximum GDP of {richest_country['GDP'].values[-1]:.2e} belongs to: {richest_country['Pais'].values[-1]}")

# Isolate the poorest country
poorest_country = happiness_index_data.nsmallest(1, 'GDP')
print(f"\nThe lowest GDP of {poorest_country['GDP'].values[0]:.2e} belongs to: {poorest_country['Pais'].values[0]}")
The maximum GDP of 2.09e+13 belongs to: United States
The second maximum GDP of 1.47e+13 belongs to: China

The lowest GDP of 1.22e+09 belongs to: Comoros
In [105]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 2))
sns.boxplot(x=happiness_index_data['GDP'], color='lightpink')
plt.title('Visual Identification of GDP')
plt.show()
No description has been provided for this image

Transformations to the data¶

The summary statistics we see and the outreach they have, has a massive scale difference netween the happiness values and the GDP, thus seeing them in. single line chart would make this hard for the linear relationship we are establishing. There are two posibles choices, and while 'scaling' the GDP (e.g., dividing by $10^9$ to represent values in billions) improves human readability, it does not address the underlying mathematical skewness we see in the box plot and general GDP data.

So the choice was to explored logarithmic transformation to improve model linearity.

Hence, in the next few cells we will be importing numpy to manage the transformations

Skewness (if the definition is blurry to you) This is a statistical measure that describes the asymmetry of a dataset's distribution around its mean. While a "normal" distribution looks like a perfectly symmetrical bell, skewed data is "leaned" or stretched to one side. Irs especially important for linear regressions as they rely of the implication that data is mostly symmetrical.

In [106]:
# import numpy to be able to manage the tranformations
import numpy as np

# Create log-transformed GDP
happiness_index_data['log_GDP'] = np.log(happiness_index_data['GDP'])

# decsibr the data to see the differences
happiness_index_data.describe()
Out[106]:
Felicidad GDP log_GDP
count 141.000000 1.410000e+02 141.000000
mean 5.560004 5.889942e+11 25.155928
std 1.098011 2.221612e+12 1.895052
min 2.403800 1.223876e+09 20.925289
25% 4.887900 1.805117e+10 23.616476
50% 5.585300 6.215800e+10 24.852945
75% 6.309100 3.452960e+11 26.567668
max 7.821000 2.089370e+13 30.670469

Following the logarithmic adjustment of our economic data, we must now verify that this transformation has successfully addressed the asymmetry identified in the raw GDP values. This phase serves as the final step of our Data Understanding process.

In [107]:
# do the 1st scatter plot with the original data
plt.scatter(happiness_index_data['GDP'], happiness_index_data['Felicidad'], alpha=0.5, c='c')
plt.xlabel('Raw GDP (USD)')
plt.ylabel('Happiness Score')
plt.title('Happiness vs. Raw GDP')
# This forces the X-axis to show scientific notation (e.g., 1e13) for readability
plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
plt.show()
No description has been provided for this image
In [108]:
plt.scatter(happiness_index_data['log_GDP'], happiness_index_data['Felicidad'], alpha=0.5, c='m')
plt.xlabel('Logged GDP (USD)')
plt.ylabel('Happiness Score')
plt.title('Happiness vs. Tranformed Log GDP')
plt.show()
No description has been provided for this image

Analysis¶

We examined the scatter plot of Felicidad vs. Log_GDP to confirm if the previously dense left-handed relationship has been linearized. While the second scatter plot is not perfectly linear, the logarithmic scale has effectively "spread out" the data points. This expansion across the X-axis handles the exponential variance of the GDP, allowing for a better analysis of countries at all economic tiers. It is through this that we can now proceed to the predictive modeling phase.

Variable Definition¶

Thus, as mentioned before, it is convinient to start defining that:

  • Target Variable ($y$): Happiness (Felicidad). This is our dependent variable, representing the subjective well-being we aim to predict and explain through our analysis.
  • Feature ($X$): The GDP per capita. This is our independent (predictor) variable. In our initial Simple Linear Regression, we treat GDP as the primary driver of national happiness.
In [109]:
X = happiness_index_data['log_GDP']
y = happiness_index_data['Felicidad']

Linear Regression¶

In order to get to be able to see the relationship first in the simplest way possible we will start with defining only one feature to predict or explain the relationship and behaviour of our target vaiable. In this section, we move from visualization to Parameter Estimation. We will manually calculate the optimal coefficients for our linear model:$$Y = \beta_0 + \beta_1X + \epsilon$$

  • $\beta_1$ (Slope): Represents the "Happiness Return" for every unit increase in Log_GDP.
  • $\beta_0$ (Intercept): Represents the theoretical happiness level if Log_GDP were zero.

To ensure code maintainability and eliminate human error, we decompose the slope equation ($\beta_1$) into its two fundamental mathematical components: the Covariance (numerator) and the Variance (denominator). I use the numpy library to perform vectorized summations sum. The formulas implemented are as follows:

Slope ($\beta_1$): $\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$

Intercept ($\beta_0$): $\beta_0 = \bar{y} - \beta_1 \bar{x}$

In [110]:
# get the average of X
xBar = np.mean(X)

# get the average of y
yBar = np.mean(y)

# get the numerator of the function with the sum function
B1_num = np.sum((X - xBar) * (y - yBar))

# now get the denominator
B1_den = np.sum((X - xBar)**2)

# Calcula B1 como la proporción entre el numerador y el denominador
B1 = B1_num/B1_den

# Calcula B0
B0 = yBar - (B1*xBar)

# Imprime el valor de B0
print(f"B0: {B0}")

# Imprime el valor de B1
print(f"B1: {B1}")
B0: -1.3023500570747268
B1: 0.27279272665849097

The Analysis¶

In this case our manual calculations for the minimized coefficients yield the following results:

  • $\beta_0$ = -1.3023500570747268. This means that the line has a "bias" or starting point of -1... This simply represents the trajectory of the line.
  • $\beta_1$ = 0.27279272665849097. This gives us a bit more information, in this case the small slope makes sense because the scale of x and y is for boht small values. This small positive slope tho lets us know that the relationship between both rates is positive and exists. In the context of the original data the slope just reflects the significant effort required to raise a nation's average well-being (happiness).

Now, with the optimal parameters calculated ($\beta_0$ and $\beta_1$), we can now construct our Line of Best Fit. This line represents the path that minimizes the total error between our estimation and the actual reported happiness scores.

To visualize this, we generate a new variable, $\hat{y}$ (y-hat). For every country in our dataset, $\hat{y}$ represents the Happiness Score our model expects them to have based on their GDP. We overlay this linear trend onto our existing scatter plot. This allows us to visually inspect the Goodness of Fit and observe how closely the national data points cluster around our calculated trajectory.

In [111]:
# Code to do the scatter plot again
plt.scatter(X, y, alpha=0.7, c='m')
plt.xlabel("Log GDP")
plt.ylabel("Happiness Score")

# Usando x, B0, y B1, guarda el valor de Y estimada en una variable de nombre yHat
yhat = B0 + B1*X

# Agrega la línea que representa al modelo con la función plot
plt.plot(X, yhat, c='r', linewidth=3, alpha = 0.5)

# Muestra la gráfica usando la función show
plt.show()
No description has been provided for this image

As mentioned before we just see 3 main inferences:

  1. The line goes up, confirming that wealth is a driver of happiness. However, because the line isn't very steep ($B_1$), it tells us that raising happiness requires a bigger (exponential) jump in GDP
  2. The negative $B_0$ intercept is just the mathematical "anchor." In the real world, this suggests that the relationship only starts making sense once a country reaches a certain amount, before its not the main driver (if one of them).
  3. Notice that many countries are still far away from the line, these are overperformers and indicate that the GDP isn't the only factor (or maybe even one of them)

Metrics & Revision¶

Now we will look into the metrics to get the residual sum of squares (RSS) and save it in its respective variable RSS. This is an important metric as it measures the total amount of error between your mathematical line and the actual dots in the scatter plot. We will use the vectorized summation from numpy sum as we did previously.

In mathematical terms, the Residual Sum of Squares is defined as:$$RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

The result¶

In [112]:
# get each of the residuals between the "predicted" y and the actual dot
residuals = y - yhat

# calculate the sum of this residuals squared with the help of numpy
rss = np.sum(residuals**2)

# Lets print the value
print(f"RSS: {rss: 4f}")
RSS:  131.373832

The Standard Error (SE) acts as a measure of "stability." It tells us how much our slope ($\beta_1$) would fluctuate if we were to repeat this study with a completely different set of 141 countries.

  • High SE: Suggests our results might be due to "luck" or a specific quirk of this dataset.
  • Low SE: Suggests high reproducibility, meaning the relationship between wealth and happiness is a consistent global phenomenon.

Following a modular programming approach we used in the past formulas to minimize error, we decompose the $SE_{\beta_1}$ equation. We utilize numpy for the square root and summation operations. The mathematical objective is to compare the "unexplained noise" (RSS) against the "spread" of our GDP data (Variance of X).The equation for the Standard Error of the slope is: $$SE(\beta_1) = \sqrt{\frac{RSS / (n - 2)}{\sum (x_i - \bar{x})^2}}$$

The result¶

To intepret the result we got, we would say that if we were to collect data from a different set of nations and recalculate the model, we expect our slope ($\beta_1$) to vary by only about $\pm 0.0435$. Because this error is very small compared to the slope itself (0.2727...), we can conclude that our model is highly stable and not merely a result of 'statistical luck'. However we will continue doing evaluations to test the results.

In [113]:
# we need ot get the value of the observations as its used. Thus len of the target value vector
n = len(y)

# get the values for the denominator
denominator_SEB1 = np.sum((X - xBar)**2)

# get the SE with the sqrt from numpy
SEB1 = np.sqrt((rss/(n-2))/(denominator_SEB1))

print(f"SEB1: {SEB1}")
SEB1: 0.043357261652203986

To finalize our reliability analysis, we move from a single "best guess" to a Confidence Interval (CI). This metric defines a range of values that we are 95% certain contains the true relationship between wealth and happiness. We use the scipy.stats library to find this value precisely. The calculation follows these steps:

  1. Find the $t$-multiplier: We use stats.t.interval to find the "cut-off" point on the bell curve where 95% of the data lives.
  2. Determine the Margin of Error: We multiply this $t$-multiplier by our Standard Error ($SE_{\beta1}$).
  3. Establish the Lower Bound ($CI_{low}$): The slope minus the margin of error.
  4. Establish the Upper Bound ($CI_{high}$): The slope plus the margin of error.

Results¶

Since your entire range is positive and does not include zero, you have statistically significant proof. This resulted in a Confidence interval between 0.187068... and 0.3585177.... This interval is relatively tight (a spread of about $0.17$). This shows that the relationship between GDP and Happiness is consistent across the 141 countries you studied.

In [114]:
# import the scipy library
import scipy.stats as st

# get the the 97.5 percentile
per = st.t.interval(confidence=0.95, df=n-2)

# get the lower bound
CIlow = B1 - (per[1] * SEB1)

# get the upper bound
CIhigh = B1 + (per[1] * SEB1)

# Print the Confidence interval
print(f"We are 95% confident the slope is between {CIlow} and {CIhigh}")
We are 95% confident the slope is between 0.18706771472547046 and 0.3585177385915115

The next step in our analysis is to determine the statistical significance of the relationship we've modeled. To achieve this, we calculate the t-statistic and the p-value, which allow us to quantify the reliability of our slope.

The t-statistic¶

The t-statistic acts as a "Signal-to-Noise" ratio. It measures how many times larger our "signal" (the relationship identified by the model) is compared to the "noise" (the random error or uncertainty in the data).We calculate it using our previously defined coefficients: $$t = \frac{\text{Signal}}{\text{Noise}} = \frac{\beta_1}{SE_{\beta_1}}$$

Our result (6.2917..) indicates that our signal is around 6 times stronger than the statistical noise, which represents a very strong and robust result.

The p-value¶

The p-value quantifies the probability of our slope being a product of "magic" or an amazing coincidence. It answers the following question:

"If there was actually ZERO relationship between money and happiness in the real world, what is the chance that our data would look like this by accident?"

We are testing against the Null Hypothesis ($H_0$), which assumes no relationship exists. In data science, we compare our result to a standard "alpha" threshold of 0.05. Our model yielded a p-value of 0.0000000038. Since this is significantly lower than the standard 0.05 threshold, we officially reject the null hypothesis. This means the probability that the observed relationship between GDP and Happiness is a mere coincidence is effectively zero.

In [115]:
# get the t - statistic
t = B1 / SEB1

# get the associated p-value
p_value = st.t.sf(np.abs(t), df=n-2) * 2

# show in console the result
print(f"T-statistic: {t:.4f}")

# print the p-value
print(f"P-value: {p_value:.10f}")
T-statistic: 6.2917
P-value: 0.0000000038

To finalize our study and quantify the quality of our predictions, we calculate the Residual Standard Error (RSE) and the Coefficient of Determination ($R^2$). These metrics allow us to explain exactly how much of the "Happiness" behavior our model actually captures.

Residual Standard Error (RSE)¶

The RSE represents the average distance that the actual data points fall from our regression line. In other words, it is the "average miss" (or mistakability) of our model. To calculate this, we take the RSS and divide it by the degrees of freedom ($n - 2$, representing our observations minus the two parameters, $B_0$ and $B_1$). We then take the square root using np.sqrt.

$$RSE = \sqrt{\frac{RSS}{n - 2}}$$

This is a great metric for human comparison because it is measured in the same units as our target. If our RSE is 0.5, it means our happiness predictions are, on average, off by half a point on the 0-10 scale.

Total Sum of Squares (TSS)¶

TSS measures the total variation in happiness scores before we even consider GDP. It represents the "error" we would have if we didn't have a model at all and simply guessed the average happiness for every country. $$TSS = \sum (y_i - \bar{y})^2$$

R-Squared ($R^2$)¶

The $R^2$ is the "Final Grade" of our model. It tells us the percentage of the behavior in the Happiness Index that is successfully explained by our GDP variable.$$R^2 = 1 - \frac{RSS}{TSS}$$

In our results the $R^2$ of 0.22 means that 22% of the variation in global happiness is driven by economic output. The remaining is "unexplained" by this specific model (likely due to other factors).

In [116]:
# get the rse thorugh the sqrt of the rss / n -2
rse = np.sqrt(rss/(n-2))

# get the tss the total sum of squares
tss = np.sum((y-yBar)**2)

# get the r squared by dividng the rss and the tss
r2 = 1 - (rss/tss)

# print the results
print(f"RSE: {rse}")
print(f"R2: {r2}")
RSE: 0.9721807858537376
R2: 0.22166361654970634

Final Conclusions on Single Variable Linear Regression¶

Our model yielded an RSE of 0.97 and an $R^2$ of 0.22. These results provide a nuanced conclusion to our research:

  • The Economic Influence: An $R^2$ of 0.22 confirms that Log-GDP is a statistically significant driver of well-being, accounting for roughly 22% of the variance in global happiness. However, the remaining 78% of the variance remains unexplained by this single-feature model. This suggests that while economic output provides a foundation for happiness, the majority of national well-being is determined by non-economic factors.
  • Model Precision: With an average 'miss' (RSE) of nearly 1.0 happiness point, our model serves as a strong baseline, but highlights the need for Multiple Linear Regression to incorporate other socio-political variables for a more complete picture.

Multivariate linear Regression¶

Extensions of the Data¶

in order to make the model better, some other variables had to be introduced. To get the best ones a academic result had to be carry on, after looking into so many economica, political, social factores the followin3 variables will be the selected. These come from studies of the happiest countries, like Finland [1], where the rating in these is very good:

  1. CO₂ Emissions per Capita

    • Justification: Environmental quality impacts psychological well-being. High pollution/emissions can indicate poor air quality, climate anxiety, and lower quality of life (this is a personal choice with some ressearch that indicated that now a days it does affect a human's life style [2]).
    • Expected Relationship: Potentially negative or U-shaped (very poor countries emit little, very rich countries emit a lot but may have better environmental management)
  2. Corruption Perceptions Index (CPI)

    • Justification: Government transparency and trust in institutions are fundamental to societal well-being. High corruption erodes social trust and public services. They are not only percieved as corrupt but the fact that these corrupt countries force to make people have shortcoming and fight to overcome them, life styles and culture are directly affected [3]
    • Expected Relationship: Negative (higher corruption → lower happiness)
  3. Life Expectancy at Birth

    • Justification: this is a way to measure the overall health infrastructure, access to healthcare, and quality of life. People in healthier societies tend to report higher well-being. It is know that studies show that good health is a prerequisite for happiness, and access to care reduces insecurity, enhancing overall well-being [4].
    • Expected Relationship: Positive (longer life → higher happiness)

The first step to this multivariate analysis is to load the required datasets, we integrate into our current data frame the three additional variables grounded in socio-economic theory. We will do that in the following cell.

Source of the data: Our World in Data [5][6][7]

In [117]:
# load the CO2 csv
co2_path = "/content/drive/My Drive/Colab Notebooks/AI/co2-emissions.csv"
co2_data = pd.read_csv(co2_path)

# Load Corruption Index data
corruption_path = "/content/drive/My Drive/Colab Notebooks/AI/corruption-perception-index.csv"
corruption_data = pd.read_csv(corruption_path)

# Load Life Expectancy data
life_exp_path = "/content/drive/My Drive/Colab Notebooks/AI/life-expectancy.csv"
life_exp_data = pd.read_csv(life_exp_path)

print("Successfully loaded 3 external datasets")
print(f"  CO₂ Emissions: {len(co2_data)} rows")
print(f"  Corruption Index: {len(corruption_data)} rows")
print(f"  Life Expectancy: {len(life_exp_data)} rows")
Successfully loaded 3 external datasets
  CO₂ Emissions: 197 rows
  Corruption Index: 181 rows
  Life Expectancy: 201 rows

CO2 Emissions exploration and understanding¶

Before merging this with our happiness data, we must perform a structural audit to ensure the datasets are compatible and clean.

Steps to take to understand and see data¶

  1. We begin by identifying the "labels" and "types" of our data using columns.tolist() and dtypes. This allows us to confirm that our key (the Country) and our target (CO2 levels) are in the correct format for mathematical operations.
  2. A common error in regression is a year mismatch (comparing 2010 with 2022 happiness). We use the unique() function combined with sorted() to see exactly which years are represented. In this dataset, the output shows only 2022. Because all the data belongs to a single year, the Year column becomes a "constant." This is great news—it means we can safely drop it later as it doesn't provide any extra information for our regression.
  3. Before merging, we must check for "holes" or nulls in our data. We use the .info() function, which gives us a high-level summary of the Non-Null Count. Fortunately, our summary shows 197 non-null entries, meaning our environmental data is complete for all countries listed.
In [118]:
# Check column names
print("\nColumn names:")
print(co2_data.columns.tolist())
Column names:
['Entity', 'Code', 'Year', 'Annual CO₂ emissions', 'time']
In [119]:
# Check data types
print("\nData types:")
print(co2_data.dtypes)
Data types:
Entity                   object
Code                     object
Year                      int64
Annual CO₂ emissions    float64
time                      int64
dtype: object
In [120]:
# check if the data is for only a specifi year, 9ts through this we can drop the year column
print("\nAvailable years in dataset:")
print(sorted(co2_data['Year'].unique()))
Available years in dataset:
[np.int64(2022)]
In [121]:
# check for the is null count
co2_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Entity                197 non-null    object 
 1   Code                  197 non-null    object 
 2   Year                  197 non-null    int64  
 3   Annual CO₂ emissions  197 non-null    float64
 4   time                  197 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 7.8+ KB

Through this exploration, we have determined that most columns are redundant for our specific goal. To prepare for the merge with our Happiness dataframe, we will simplify the dataset using the following logic to keep

  • Entity: This is our "Key" to match countries between both datasets.
  • Annual CO₂ emissions: This is our new independent variable ($X_2$).

We will drop the remaining Code, Year, and time since all data is from 2022 and we already have the country names, these columns only add "noise" to our dataframe. Additionally we rename the 'Entity' value because it makes it easier for overall comprehension

In [122]:
# We define a list of columns to remove
cols_to_drop = ['Code', 'Year', 'time']

# We use the drop function.
# axis=1 tells pandas to remove COLUMNS.
co2_data.drop(columns=cols_to_drop, axis=1, inplace=True)

# We rename 'Entity' to 'Country' to make it easier to merge with our Happiness data
co2_data.rename(columns={'Entity': 'Country'}, inplace=True)

# Display the clean version
print("Cleaned CO2 Dataframe:")
print(co2_data.head())
Cleaned CO2 Dataframe:
       Country  Annual CO₂ emissions
0  Afghanistan            10169889.0
1      Albania             4498282.0
2      Algeria           192778560.0
3      Andorra              423408.0
4       Angola            21089004.0

Corruption Perception Index Exploration and Understanding¶

For this we will follow the exact same "Data Integrity" logic we used for the CO2 dataset. This index ranks countries by their perceived levels of public sector corruption (where a higher score usually means less corruption). Before merging, we must verify its structure.

Steps to follow and results we saw:¶

  1. We start by looking at the column headers and data types. This helps us identify our variable of interest and ensure the numbers are stored in a format we can use for regression.
  2. Just like with the CO2 data, we check the available years. If this dataset also focuses on a single recent year, we can treat it as a cross-sectional snapshot of global transparency.
In [123]:
# Check column names
print("\nColumn names:")
print(corruption_data.columns.tolist())
Column names:
['Entity', 'Code', 'Year', 'Corruption Perceptions Index', 'World region according to OWID', 'time', 'time.1']
In [124]:
# Check data types to ensure the index is a float or int
print("\nData Types:")
print(corruption_data.dtypes)
Data Types:
Entity                             object
Code                               object
Year                              float64
Corruption Perceptions Index      float64
World region according to OWID     object
time                              float64
time.1                              int64
dtype: object
In [125]:
# Check which years are available in this specific dataset
print("\nAvailable years in CPI dataset:")
print(sorted(corruption_data['Year'].unique()))
Available years in CPI dataset:
[np.float64(2022.0), np.float64(nan)]
In [126]:
# Perform a health check for null values
corruption_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181 entries, 0 to 180
Data columns (total 7 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Entity                          181 non-null    object 
 1   Code                            179 non-null    object 
 2   Year                            179 non-null    float64
 3   Corruption Perceptions Index    179 non-null    float64
 4   World region according to OWID  179 non-null    object 
 5   time                            179 non-null    float64
 6   time.1                          181 non-null    int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 10.0+ KB

Based on our audit and specifcallythe info() function, we can see several redundant columns and the great lack of nulls (that indicates the stored data doesnt need to be filled out, as its complete). Code, World region according to OWID, time, and time.1; all of these do not add value to our specific regression.

Additionally, we will rename Entity to Country (with the rename() function) to create a universal "Key" across all three of our dataframes. We will also rename the corruption entity name to it's shortened version for readability and space purposes, leaving it as "corruption Index" only.

The following cell will just make use of the drop() function with the axis = 1 to indicate that the columns will be the dropped ones.

In [127]:
# We remove 'time.1' and 'time' as they are likely duplicates or internal IDs
cols_to_drop = ['Code', 'Year', 'World region according to OWID', 'time', 'time.1']

# Drop columns (axis=1 for columns)
corruption_data.drop(columns=cols_to_drop, axis=1, inplace=True)

# Rename columns for merging and readability
# We change 'Entity' to 'Country' and shorten the long Index name
corruption_data.rename(columns={
    'Entity': 'Country',
    'Corruption Perceptions Index': 'Corruption_Index'
}, inplace=True)

print("Cleaned Corruption Dataframe:")
print(corruption_data.head())
Cleaned Corruption Dataframe:
       Country  Corruption_Index
0  Afghanistan              24.0
1      Albania              36.0
2      Algeria              33.0
3       Angola              33.0
4    Argentina              38.0

Life Expectancy data Exploration and Understanding¶

Life Expectancy is a powerhouse variable because it represents the "quality of life" and health infrastructure of a nation. Just like before, we need to strip away the "clutter" (like the extremely long column name) and keep only what is essential for the regression. We will perform an analysis on the data to avoid removing things that could be usefull.

Steps followed and results seen:¶

  1. We use the same diagnostic functions to ensure we are looking at the correct year and that our data types are ready for math. Its from this and the help of the .unique() function that we see that the Year is a single value, thus dropping is a good choice.
  2. Null revision and general data type checks. None of the data is null, and thus we can drop any columns we dont need for the multivariate regresison.
  3. "prune" the redundant columns and rename the main feature to something short and professional: Life_Expectancy.
In [128]:
# Check the raw column names
print("\nLife Expectancy Columns:")
print(life_exp_data.columns.tolist())

# Check for temporal consistency
print("\nAvailable years in Life Expectancy dataset:")
print(sorted(life_exp_data['Year'].unique()))
Life Expectancy Columns:
['Entity', 'Code', 'Year', 'Life expectancy - Sex: all - Age: 0 - Variant: estimates', 'time']

Available years in Life Expectancy dataset:
[np.int64(2022)]
In [129]:
# check for possible nulls
life_exp_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 5 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Entity                                                    201 non-null    object 
 1   Code                                                      201 non-null    object 
 2   Year                                                      201 non-null    int64  
 3   Life expectancy - Sex: all - Age: 0 - Variant: estimates  201 non-null    float64
 4   time                                                      201 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 8.0+ KB
In [130]:
# redundant columns to remove
cols_to_drop = ['Code', 'Year', 'time']

# .drop() to clean the dataframe
life_exp_data.drop(columns=cols_to_drop, axis=1, inplace=True)

# We change 'Entity' to 'Country' and simplify the massive life expectancy label
life_exp_data.rename(columns={
    'Entity': 'Country',
    'Life expectancy - Sex: all - Age: 0 - Variant: estimates': 'Life_Expectancy'
}, inplace=True)

print("Cleaned Life Expectancy Dataframe:")
print(life_exp_data.head())
Cleaned Life Expectancy Dataframe:
       Country  Life_Expectancy
0  Afghanistan           65.617
1      Albania           78.769
2      Algeria           76.129
3      Andorra           84.016
4       Angola           64.246
In [131]:
# rename 'Pais' to 'Country' for consistency in the OG dataset
happiness_index_data_extended = happiness_index_data.copy()
happiness_index_data_extended.rename(columns={'Pais': 'Country'}, inplace=True)

# merge CO2 data
happiness_index_data_extended = happiness_index_data_extended.merge(
    co2_data,
    on='Country',
    how='left'  # Keep all happiness countries
)
print(f"   After merge: {len(happiness_index_data_extended)} rows")
print(f"   Missing C02 data: {happiness_index_data_extended['Annual CO₂ emissions'].isna().sum()} countries")
   After merge: 141 rows
   Missing C02 data: 2 countries
In [132]:
# Merge Corruption data
happiness_index_data_extended = happiness_index_data_extended.merge(
    corruption_data,
    on='Country',
    how='left'
)
print(f"   After merge: {len(happiness_index_data_extended)} rows")
print(f"   Missing Corruption data: {happiness_index_data_extended['Corruption_Index'].isna().sum()} countries")
   After merge: 141 rows
   Missing Corruption data: 2 countries
In [133]:
# merge the life expectancy data
happiness_index_data_extended = happiness_index_data_extended.merge(
    life_exp_data,
    on='Country',
    how='left'
)
print(f"   After merge: {len(happiness_index_data_extended)} rows")
print(f"   Missing Life Expectancy data: {happiness_index_data_extended['Life_Expectancy'].isna().sum()} countries")
   After merge: 141 rows
   Missing Life Expectancy data: 2 countries

After performing our left merges, we noticed that 2 countries are missing values for CO2, Corruption, and Life Expectancy. To fix this, we first need to identify which countries are causing the gap. We do this by filtering our dataframe for rows where these specific columns contain Null (NaN) values.

By using the .isna() function, we can pinpoint these specific outliers. This allows us to decide whether we should manually rename them in the source files to ensure a 100% match or exclude them if the data simply does not exist for those nations.

In [134]:
# identify which countries are missing CO2 data
missing_co2 = happiness_index_data_extended[happiness_index_data_extended['Annual CO₂ emissions'].isna()]
print("Countries missing CO2 data:")
print(missing_co2['Country'].tolist())

# identify which countries are missing Corruption data
missing_corruption = happiness_index_data_extended[happiness_index_data_extended['Corruption_Index'].isna()]
print("\nCountries missing Corruption data:")
print(missing_corruption['Country'].tolist())

# identify which countries are missing Life Expectancy data
missing_life = happiness_index_data_extended[happiness_index_data_extended['Life_Expectancy'].isna()]
print("\nCountries missing Life Expectancy data:")
print(missing_life['Country'].tolist())
Countries missing CO2 data:
['Hong Kong', 'Ivory Coast']

Countries missing Corruption data:
['Hong Kong', 'Ivory Coast']

Countries missing Life Expectancy data:
['Hong Kong', 'Ivory Coast']
In [135]:
# create sets for each dataset's country list
h_set = set(happiness_index_data['Pais'].unique())
co2_set = set(co2_data['Country'].unique())
c_set = set(corruption_data['Country'].unique())
l_set = set(life_exp_data['Country'].unique())

# find countries in Happiness that are MISSING in the others
missing_in_co2 = sorted(list(h_set - co2_set))
missing_in_corruption = sorted(list(h_set - c_set))
missing_in_life = sorted(list(h_set - l_set))

# print
print(f"Missing in CO2 ({len(missing_in_co2)}): {missing_in_co2}")
print(f"Missing in Corruption ({len(missing_in_corruption)}): {missing_in_corruption}")
print(f"Missing in Life Exp ({len(missing_in_life)}): {missing_in_life}")
Missing in CO2 (2): ['Hong Kong', 'Ivory Coast']
Missing in Corruption (2): ['Hong Kong', 'Ivory Coast']
Missing in Life Exp (2): ['Hong Kong', 'Ivory Coast']

As identified in our set difference analysis, the "Missing" data for Ivory Coast was due to a naming mismatch (the source datasets use the French name "Cote d'Ivoire"). However, for Hong Kong, the data is truly missing from the environmental and corruption sources.

Following the course instructions to work only with countries that have complete information for all variables, we will now standardize the names and perform a final pruning of incomplete records. This ensures our multivariate model is built on a complete case basis

Resolving Naming Conflicts (Standardization)¶

One of the most common hurdles in global data science is nomenclature inconsistency. Different international organizations often use different names for the same country.

The Problem we found: During our audit, we discovered that "Ivory Coast" appeared as a missing value across all new datasets. A set analysis revealed that the environmental and corruption datasets used the French name, "Cote d'Ivoire", while our primary happiness dataset used "Ivory Coast". We use the .replace() method with a name_fixes dictionary to standardize this label across all source dataframes before merging. This prevents the unnecessary loss of a valid data point.

The Aggregation Process (The "Left" Merge)¶

We build our master dataset using a series of Left Joins. Why a Left Merge? By setting how='left', we use our original Happiness/Log_GDP dataframe as the "Anchor." The code keeps all countries from our primary study and searches the external datasets for matching information based on the 'Country' column. For each country in our base list, Python looks into the CO2, Corruption, and Life Expectancy files. If it finds a match, it pulls the value in; if no match is found (even after naming fixes), it assigns a NaN.

In [136]:
# standardize naming for Ivory Coast across all source dataframes
name_fixes = {"Cote d'Ivoire": "Ivory Coast"}

co2_data = co2_data.replace({"Country": name_fixes})
corruption_data = corruption_data.replace({"Country": name_fixes})
life_exp_data = life_exp_data.replace({"Country": name_fixes})

# re do the merge starting from the original Log_GDP data
df_master = happiness_index_data_extended[['Country', 'Felicidad', 'log_GDP']].copy()

# merge all three additional features
df_master = df_master.merge(co2_data[['Country', 'Annual CO₂ emissions']], on='Country', how='left')
df_master = df_master.merge(corruption_data[['Country', 'Corruption_Index']], on='Country', how='left')
df_master = df_master.merge(life_exp_data[['Country', 'Life_Expectancy']], on='Country', how='left')

Visualizing Feature Distributions¶

Before constructing our final multivariate model, we must examine the "shape" of our three new variables ($X_2, X_3, X_4$) alongside our target ($y$). Using a grid of histograms, we can assess whether these features follow a normal distribution or if they suffer from significant mathematical skewness.

Just as we discovered with raw GDP, variables like CO₂ Emissions often show an "L-shaped" distribution, where a few high-emitting industrial nations act as extreme outliers. Identifying these differences helps us understand the magnitude of the resulting $\beta$ coefficients.

Linear regression performs best when the predictors and the target are relatively symmetrical (if any of them is as asymmetrical or skewed as our annual co2 emissions were we would have to do tranformtions).

We see that CO2_emissions is heavily right-skewed. Most nations are clustered near zero, while a few "super-polluters" stretch the X-axis out to $1 \times 10^{10}$. Corruption Index shows a more spread-out distribution, though it has a slight peak in the lower-middle range (30-40), suggesting many countries still struggle with transparency.Life Expectancy: This is left-skewed, meaning most countries in our dataset have achieved relatively high life expectancies (70+ years), with a smaller "tail" of countries falling behind.

In [137]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# CO2 Distribution
axes[0, 0].hist(happiness_index_data_extended['Annual CO₂ emissions'], bins=25, alpha=0.7, color='lightblue')
axes[0, 0].set_xlabel('CO₂ Emissions (tons per capita)', fontsize=11)
axes[0, 0].set_ylabel('Frequency', fontsize=11)
axes[0, 0].set_title('Distribution of CO₂ Emissions')

# corruption Distribution
axes[0, 1].hist(happiness_index_data_extended['Corruption_Index'], bins=25, alpha=0.7, color='lightpink')
axes[0, 1].set_xlabel('Corruption Perceptions Index (0-100)', fontsize=11)
axes[0, 1].set_ylabel('Frequency', fontsize=11)
axes[0, 1].set_title('Distribution of Corruption Index')

# Life Expectancy Distribution
axes[1, 0].hist(happiness_index_data_extended['Life_Expectancy'], bins=25, alpha=0.7, color='#b19cd9')
axes[1, 0].set_xlabel('Life Expectancy (years)', fontsize=11)
axes[1, 0].set_ylabel('Frequency', fontsize=11)
axes[1, 0].set_title('Distribution of Life Expectancy')

# Happiness (for comparison)
axes[1, 1].hist(happiness_index_data_extended['Felicidad'], bins=25, alpha=0.7, color='#fff192')
axes[1, 1].set_xlabel('Happiness Score', fontsize=11)
axes[1, 1].set_ylabel('Frequency', fontsize=11)
axes[1, 1].set_title('Distribution of Happiness (Target)')

plt.show()
No description has been provided for this image

Ensuring Statistical Integrity¶

In our final step, ofc we also realized that Hong Kong was missing, this our final step is to execute df_master.dropna(inplace=True). This is a critical decision for the statistical health of the model. Unlike Ivory Coast, data for Hong Kong was simply not present in the supplementary environmental or institutional sources provided for this specific year.

Justification on why we must drop it Linear regression algorithms cannot perform mathematical operations on "Null" or "NaN" values. Every observation ($n$) must have a valid value for every feature ($X_1, X_2, X_3, X_4$) to calculate the coefficients. Keeping a row with missing values would result in an error during the matrix calculation (the Normal Equation). While it is always preferable to have a larger sample size, it is better to have 139 or 140 countries with perfect data than 141 countries where one entry "breaks" the mathematical logic of the model.

In [138]:
# drop incomplete rows (e.g., Hong Kong) to ensure a complete case basis
df_master.dropna(inplace=True)

print(f"Final observation count (n): {len(df_master)} countries.")
df_master.head()
Final observation count (n): 140 countries.
Out[138]:
Country Felicidad log_GDP Annual CO₂ emissions Corruption_Index Life_Expectancy
0 Finland 7.8210 26.328468 36337000.0 87.0 81.243
1 Denmark 7.6362 26.598435 29094320.0 90.0 81.291
2 Iceland 7.5575 23.801411 3597685.0 74.0 81.588
3 Switzerland 7.5116 27.346332 32950562.0 82.0 83.200
4 Netherlands 7.4149 27.540949 127503144.0 80.0 81.912

tranformations to C02 scales¶

From the data distribution graphs we can see that the C02 emissions' data is heavily right-skewed, with most countries at the bottom and a few "giant" emitters stretching the scale (very similar to how it happened in the case for the GDP).

In Linear Regression, when a variable is that skewed, the model struggles to find a pattern. Transforming it into log_CO2 (just like we did with GDP) is the "perfect" way to see if environmental impact actually affects happiness. this way we can "compress" the extreme outliers and see if a normalized environmental variable provides better explanatory power.

If we are confident that our dataset contains only positive values (no zeros), we can proceed with a standard Natural Logarithm. However, we should always perform a quick "safety check" first to ensure we don't produce an error. We will describe the columns to get the lowest value, avoid any potential zeros.

In [139]:
min_co2 = df_master['Annual CO₂ emissions'].min()
print(f"The lowest emission value in our data is: {min_co2}")
The lowest emission value in our data is: 531671.0

With the lowest value for the anuall emissions being over 500 thousand we can safely do the log transformation (without having to use the log1p). Unlike socio-economic variables that might occasionally hit a 'zero' baseline, industrial emissions, even in developing nations, maintain a positive value.

We will further visualize the distribution of the now changed data, in the following graphs we can see the before and after of the logarithmic transformation done on the column. We were able to remove the skew and hope that this can change and improve our model's accuracy. We used np.log() to transform the columns, we plotted with the help of mathplotlib and finally did the two histograms on a same figure.

In [140]:
# apply  log transformation
df_master['log_CO2'] = np.log(df_master['Annual CO₂ emissions'])

# helps visualize the Before vs After
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Original
ax1.hist(df_master['Annual CO₂ emissions'], bins=25, color='lightblue')
ax1.set_title('Original CO₂ (Heavy Skew)')

# Transformed
ax2.hist(df_master['log_CO2'], bins=25, color='skyblue')
ax2.set_title('Log-Transformed CO₂ (Normalized)')

plt.show()
No description has been provided for this image
In [141]:
# drop the annual co2 emissions
df_master.drop('Annual CO₂ emissions', axis=1, inplace=True)

df_master.head()
Out[141]:
Country Felicidad log_GDP Corruption_Index Life_Expectancy log_CO2
0 Finland 7.8210 26.328468 87.0 81.243 17.408347
1 Denmark 7.6362 26.598435 90.0 81.291 17.186054
2 Iceland 7.5575 23.801411 74.0 81.588 15.095801
3 Switzerland 7.5116 27.346332 82.0 83.200 17.310519
4 Netherlands 7.4149 27.540949 80.0 81.912 18.663652

Multivariate Linear Regression Implementation¶

With our extended dataset ready, we now aim to model happiness ($Y$) as a function of four predictors:

  • Log-GDP ($X_1$)
  • CO2 Emissions ($X_2$)
  • Corruption ($X_3$)
  • Life Expectancy ($X_4$).

The model follows the equation: $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \epsilon$$

Before modeling, we must separate our data to ensure the model can generalize to new observations. We use an 80/20 split: 80% to train the model and 20% to validate its predictive power.

In [142]:
from sklearn.model_selection import train_test_split

# We split the 'df_master' created in the previous step
train, test = train_test_split(df_master, train_size=0.8, random_state=42)

# print the numbers for general curiosity nd reinfocement
print(f"Training set size: {train.shape[0]} countries")
print(f"Validation set size: {test.shape[0]} countries")
Training set size: 112 countries
Validation set size: 28 countries
In [143]:
train.head()
Out[143]:
Country Felicidad log_GDP Corruption_Index Life_Expectancy log_CO2
16 United Kingdom 6.9425 28.645128 73.0 81.074 19.555683
18 Belgium 6.8050 26.980314 73.0 81.159 18.303191
10 Austria 7.1630 26.794599 71.0 81.296 17.933802
112 Uganda 4.6026 24.350280 26.0 67.675 15.628420
102 Gabon 4.9583 23.452218 29.0 67.713 15.570477

We utilize the Ordinary Least Squares (OLS) method from the statsmodels library. This tool kit contains all the specialized math formulas needed to calculate the "best-fit" line in a multidimensional space.

Key Steps

  1. Start with Feature Selection ($X$), we isolate our predictors by dropping non-numeric columns like Country and our target Felicidad using the drop() function in our training dataset.
  2. To get the intercept we use add_constant(X) to ensure the model includes a baseline intercept ($\beta_0$), preventing the line from being forced through the origin.
  3. Then the model is defined with the OLS() function that takes as arguments our constant and the isloated target variable (y_train)
  4. The fit() function executes the optimization algorithm to determine the mathematical weight of each feature and saving these in results variable.

To see the obtained results we used the summary() function to print them into console.

In [148]:
import statsmodels.api as sm

# we have to drop the target column from the training data
X = train.drop(['Felicidad', 'Country'], axis=1)

# define our trget variable
y_train = train['Felicidad']

# define the kind of model (OLS)
model = sm.OLS(y_train, sm.add_constant(X))

# adjunst or fit the model
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Felicidad   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.634
Method:                 Least Squares   F-statistic:                     49.09
Date:                Mon, 26 Jan 2026   Prob (F-statistic):           2.19e-23
Time:                        19:56:53   Log-Likelihood:                -110.10
No. Observations:                 112   AIC:                             230.2
Df Residuals:                     107   BIC:                             243.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -3.4957      1.320     -2.649      0.009      -6.111      -0.880
log_GDP              0.2207      0.111      1.996      0.048       0.001       0.440
Corruption_Index     0.0119      0.006      2.137      0.035       0.001       0.023
Life_Expectancy      0.0899      0.013      6.801      0.000       0.064       0.116
log_CO2             -0.2070      0.103     -2.016      0.046      -0.411      -0.003
==============================================================================
Omnibus:                       20.399   Durbin-Watson:                   2.032
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.259
Skew:                          -1.015   Prob(JB):                     3.27e-06
Kurtosis:                       4.138   Cond. No.                     1.95e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.95e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Results¶

Once the model is built, we need to know if the results are meaningful or just random noise. We can now see a lot of factor on the model with the past summary of the data, from the amount of observations we have to some more rekevant values.

1. R-Squared ($R^2$) The model's interpretability has jumped from the 22% we had in the linear regression to a 64.7%. Thus with our multivariate regression we can say that by adding Health, Environment, and Transparency, you are now capturing the majority of the "behavior" of global happiness. This justifies that we can now say that 63.4% of the variance in our happiness data is explained by the combination of our features

2. F-ststistic The F-statistic is a ratio that compares the variance your model explains to the variance it cannot explain. It tells you how much better your multivariate model (Log-GDP, Log_CO2, Corruption, and Life Expectancy) is at predicting happiness than a model with zero variables that just guesses the average. The larger the F-statistic, the more likely it is that your group of variables has a real effect on happiness. Our value of 49.09 is a strong indicator of model strength, but the p-value provides the definitive confirmation.

3. P-value This is the prob(F-ststistic), here this answers the question of the "luck" we had. With our 2.19e-23 (nearly zero) value we can say that the probability of this happened purely by random chance is near zero, thus the probbabilty that these results happened by a "fluke" is nearly impossible

Using a significance threshold of 0.05, we can now see a much more balanced contribution from all our predictors:

  • Life Expectancy (p = 0.000): This remains our strongest "MVP" variable. It is the most reliable predictor of happiness in our dataset, showing that physical health is the primary pillar of well-being.
  • Corruption Index (p = 0.035): By being below 0.05, this variable is statistically significant. It confirms that institutional trust and transparency are critical for a happy citizenry.
  • log_GDP (p = 0.048): Previously, GDP was "dead weight" in the multivariate model. However, with the log transformation, it has crossed the finish line into significance. This tells us that wealth does impact happiness, but in a non-linear way.
  • log_CO2 (p = 0.046): This is the most interesting result. By transforming the CO2 data, we found a significant relationship. The negative coefficient (-0.2070) indicates that as carbon footprints increase (often a proxy for pollution or rapid, unregulated industrialization), happiness tends to decrease. Which is a factor that makes sense.

Manual Model Verification¶

After building our optimized multivariate model, we perform a manual "under the hood" check. This code serves as a mathematical audit to verify that the automated results from our OLS summary are correct and to calculate exactly how much "Signal" our model generates compared to random "Noise".

The code implements the Decomposition of Variance, breaking down the total behavior of Happiness into two parts. in the following steps:

  1. Predicting and Averaging (yhat & ybar): We generate the model's "guesses" and compare them to the simple average happiness of all countries.
  2. The Explained Variation (ESS & EMS): We calculate the Explained Sum of Squares (ESS) to see how much better our model is than a simple average. We then divide this by our number of variables ($m=4$) to get the Explained Mean Square (EMS), which is the "Strength per Variable".
  3. The Error (RSS & RMS): We calculate the Residual Sum of Squares (RSS), which represents the total error or "noise" the model couldn't capture. Dividing this by the remaining degrees of freedom ($n - m - 1$) gives us the Residual Mean Square (RMS)—the "Average Noise".
  4. The Final Ratio (F & p-value): The F-Statistic is simply the ratio of our Signal ($EMS$) divided by our Noise ($RMS$).

Anlsysis of Results Our manual calculations yielded specific values that tell a powerful story about our current model:

  • RSS = 46.83: This is our total "Unexplained Error." While this number seems large, it is significantly lower than our previous models, indicating that our features (Log-GDP, log-CO2, Corruption, and Life Expectancy) are getting much closer to the truth.
  • F = 49.09: This is our Signal-to-Noise Ratio. It tells us that the patterns found by our model are 49 times stronger than the random noise in the data. In statistics, any F-value significantly higher than 1 is good; 49 is exceptionally strong.
  • p-value = 2.19e-23: This is our final "Reality Check." Because this number is virtually zero, we can state with absolute certainty that our model’s success is not a "fluke" or a result of lucky random numbers.
In [145]:
import scipy.stats as st
import numpy as np

# get the predicted y
yhat = results.predict(sm.add_constant(X))

# get the mean y
ybar = np.mean(y_train)

# get the ess
ESS = sum((yhat - ybar)**2)

# get the m
m = X.shape[1]

# get the EMS
EMS = ESS / m

# RSS calculation with the previous values defines
RSS = sum((y_train - yhat)**2)

# get the number of observations
n = X.shape[0]

# RMS and the F statistic
RMS = RSS / (n - m - 1)
F = EMS / RMS

# Finaly get the p-value
pval = st.f.sf(F, m, n - m - 1)

#print
print("RSS =", RSS)
print("F =", F)
print("p-value =", pval)
RSS = 46.83382692162639
F = 49.086006365049826
p-value = 2.1892007201308535e-23

Reduced Model¶

To evaluate the specific impact of an individual variable, we implement a Reduced Model. By using the .drop() function to remove a single predictor and re-fitting the OLS model, we can measure the change in the Residual Sum of Squares (RSS). A s the annual C02 emissions were my "test" variable or factos that were more personally selected out fo curiosity and the fact that its individual p-value wasnt that statistically significant or a big explanation with th other values, i decided to drop this column. This allows us to perform a Partial F-Test, determining if the excluded variable provided unique information that cannot be explained by the other features in the set.

In [147]:
# to test the significance we drop the co2
XNew = X.drop('log_CO2', axis = 1)

# created the new model and feed it the params
modelNew = sm.OLS(y_train, sm.add_constant(XNew))

#. fir to the new model
resultsNew = modelNew.fit()

# get the summary of the results
print(resultsNew.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Felicidad   R-squared:                       0.634
Model:                            OLS   Adj. R-squared:                  0.624
Method:                 Least Squares   F-statistic:                     62.32
Date:                Mon, 26 Jan 2026   Prob (F-statistic):           1.83e-23
Time:                        19:52:59   Log-Likelihood:                -112.18
No. Observations:                 112   AIC:                             232.4
Df Residuals:                     108   BIC:                             243.2
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -1.7314      1.002     -1.729      0.087      -3.717       0.254
log_GDP              0.0122      0.040      0.307      0.760      -0.066       0.091
Corruption_Index     0.0180      0.005      3.804      0.000       0.009       0.027
Life_Expectancy      0.0848      0.013      6.446      0.000       0.059       0.111
==============================================================================
Omnibus:                       18.961   Durbin-Watson:                   2.080
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.722
Skew:                          -0.978   Prob(JB):                     1.16e-05
Kurtosis:                       4.020   Cond. No.                     1.43e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.43e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Comparison of the two possible Multivariable regression Models¶

By purposely dropping the CO2 variable, we performed a "Sensitivity Analysis" to see if it was actually helping us predict happiness. The results are super interesting when you look at the numbers side-by-side.

  1. The $R^2$ stayed almost exactly the same it went from $R^2 = 0.647 to $R^2 = 0.634. What this tells us is that **deleting CO2 only changed our accuracy by -0.013 **. This is mathematical proof that CO2 was "dead weight", meaning that it wasn't really telling the model anything too relevant about happiness that GDP or Life Expectancy hadn't already covered.
  2. The F-statistic Jumped Up. We had $F = 49.09$ and now we have $F = 62.32$ this in statistics states that a its a "tighter" and more efficient model. By removing the "noise" (CO2), the overall group of variables we have left is actually statistically stronger than the bigger group we had before.
  3. The "P > |t|" (individual p-values) looking at the new table, Corruption_Index and Life_Expectancy are still the absolute "MVPs" with p-values of 0.000.
    • A critical observation in our refined model is the jump of the log_GDP p-value to 0.760 after removing the CO₂ emissions variable. This high p-value indicates that GDP lacks 'unique explanatory power' when Life Expectancy and Corruption are present. This supports the theory that wealth is a 'distal' cause of happiness. Money primarily matters because it allows a nation to invest in healthcare and institutional transparency.
  4. Its important to mention with all these changes the prob(f-ststistic) remained the same.

Final Comparison of the Models¶

In this final section, we contrast our initial Simple Linear Regression with the optimized Multivariate Model. This comparison allows us to evaluate whether the increased complexity of the model translates into a more accurate and meaningful understanding of global happiness.

The Data Simple Regression (GDP only)

  • ($R^2$) 0.222
  • 0.97 Happiness points
  • (F-Stat)39.58
  • (p-value) Significant $p < 0.05$

Multivariate Model

  • ($R^2$) 0.647 (64.7%)
  • 0.67 Happiness points
  • (F-stat) 49.09.
  • (p-value) Extremely Significant ($p < 0.01$)

Additonal Variables. Good?¶

The inclusion of additional variables significantly improved the model's capacity to explain national well-being. Moving from a single-variable to a multi-variable approach nearly tripled our $R^2$, jumping from 22% to over 63%. This confirms that happiness is a multifaceted phenomenon that cannot be captured by economic output alone.

GDP question? In the univariate model, Log-GDP was a strong, significant predictor. However, in the multivariate model, its individual significance vanished ($p = 0.760$). This suggests that wealth acts as a "proxy"; money matters primarily because it facilitates access to healthcare (Life Expectancy) and honest institutions (Corruption Index), both of which remained highly significant ($p = 0.000$).

Limitations¶

Despite the model's success, several constraints must be acknowledged:

  • Missing Data: To maintain statistical integrity, we had to drop observations like Hong Kong due to missing values in the environmental and corruption datasets. This slightly reduces the global representativeness of our findings.
  • Variable Selection: While we added three relevant variables, others such as Social Support, Personal Freedom, and Cultural Factors were not included. Their absence accounts for the remaining ~36% of unexplained variance.
  • Linearity Assumption: The model assumes a linear relationship. As seen in the initial GDP plots, some variables exhibit exponential or skewed behavior that might be better captured by non-linear models.

Future Work¶

To build upon this, future studies should consider:

  • Non-Linear Modeling: Implementing polynomial regression or logarithmic transformations for features like CO2 emissions to see if accuracy improves.
  • Regional Categorization: Adding "dummy variables" for geographic regions (e.g., Scandinavia vs. Sub-Saharan Africa) to account for regional happiness baselines.
  • Environmental factors: which have to do with happiness?

Conclusion¶

At the outset of this study, we hypothesized that "Happiness is complex and multifaceted" and that GDP alone would be insufficient. Our results have mathematically validated this expectation. While economic stability provides the necessary groundwork, physical health and institutional integrity are the true pillars of a happy society. By expanding our model, we have moved from a simple economic observation to a more profound socio-political inference: a nation's path to well-being lies not just in its production, but in the transparency of its government and the longevity of its people.

References:

  1. https://www.britannica.com/topic/Why-Is-Finland-the-Happiest-Country-in-the-World
  2. https://www.co2meter.com/en-mx/blogs/news/dangers-of-co2-what-you-need-to-know?srsltid=AfmBOoprWZbBS3bifvuXEzEsmyweYOcJPPoyYJo-UxMt8zqMAxQuKOIO
  3. https://baselgovernance.org/blog/culture-and-corruption-complex-relationship
  4. https://www.sciencedirect.com/science/article/abs/pii/S0749379718316817
  5. https://ourworldindata.org/co2-emissions
  6. https://ourworldindata.org/corruption
  7. https://ourworldindata.org/life-expectancy