In [21]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [22]:
# path to read from
# Please adjust the path if your file is in a different subfolder within your Google Drive.
path = '/content/drive/MyDrive/Colab Notebooks/AI/classified_deforestation.csv'
# read csv
df = pd.read_csv(path)

Overview and Introduction¶

Deforestation remains one of the most critical environmental crises of our time, directly impacting global carbon emissions, biodiversity loss, and climate change. However, deforestation does not happen in a vacuum; it is driven by a complex web of socioeconomic, demographic, and geographic factors. The purpose of this project is to build a predictive analytical tool capable of classifying a country's deforestation risk based on its socioeconomic profile. By classifying countries into Low Risk (Class 0) and High Risk (Class 1) based on previously established literature probability thresholds, we can understand (or help policymakersa) how to better allocate resources, monitor vulnerable regions, and understand the hidden drivers behind forest cover loss.

The Dataset

This notebook builds upon a previously cleaned and engineered dataset comprising socioeconomic and environmental metrics for 100+ countries. This dataset comes from the past project "Deforestation Data Insights" (found in the repositories of this github account -> rebeca-bc). This dataset is good because the data has already undergone Exploratory Data Analysis (EDA), missing-value imputation, and feature engineering in a previous phase, this notebook focuses entirely on advanced predictive modeling and algorithmic evaluation.

New Classifiers

While previous iterations of this project established a strong baseline using simple Logistic Classifiers, LDAs and Decision Trees, this phase scales the complexity by putting Ensemble Models, Support Vector Machines and Basic Neural Networks.

To being it's imperative to do a quick check from the shape, to the main structure nulls just to check and get familiar again with the data.

In [23]:
# sanity check
print("Dataset Shape:", df.shape)
Dataset Shape: (103, 21)
In [24]:
# check for null (only the higest, if > 0 then dont)
df.isnull().sum().max()
Out[24]:
0

Data splitting¶

Now, critical for any project the target variable y and the features have to be defined and saved in their respective variables (X = features and y = target).

Our y (target variable) will be the deforestation_critical column, here a treshold of 0.501% of deforestation on the land of a country. This will be dropped from the main df to then set the X (features) as the rest of the df.

In [25]:
# We drop the target AND all the temporary math columns we created so the model doesn't cheat
y = df['Deforestation_Critical']
# Define features (X) as all columns except the target 'Deforestation_Critical'
X = df.drop('Deforestation_Critical', axis=1)
# check for data balance
y.value_counts()
Out[25]:
count
Deforestation_Critical
0 72
1 31

The value_counts() output shows a class imbalance we are already familiar with from the previous project, 72 countries are non-critical (0) and 31 are critical (1), roughly a 70/30 split. While a 70/30 split is the commonality (and the division would suggest for it), an 80/20 ratio (doing test_size=0.2) is deliberately chosen for this analysis due to the smaller size of the dataset (of countries). In small-sample scenarios, particularly those with class imbalance, making a bigger training volume is critical to ensure the Logistic Regression algorithm gets sufficient examples of the minority class (High Risk) to converge accurately. A random split could inadvertently place all the minority class (High Risk) samples into the test set, leaving the model with no critical examples to learn from during training.

Because of the class distribution, it is strictly performed using stratification (stratify=y). This ensures both subsets preserve the exact proportion of Low/High Risk classes, preventing majority-class bias. Then, display the shapes of the resulting sets to confirm the splts are correct

In [26]:
from sklearn.model_selection import train_test_split
# Perform 80/20 train-test split with stratification and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Display the shapes of the resulting datasets to confirm the split
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")
Shape of X_train: (82, 20)
Shape of X_test: (21, 20)
Shape of y_train: (82,)
Shape of y_test: (21,)

Ensemble Models¶

Random Forest¶

A model that generates b subsets with m features from teh total (p), this model generates independent decisions trees to eachother. For this specific model the most important part is hyperparameter tuning. In the case of a random forest, hyperparameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node.

To find the perfect m and p values the industry standards dictate using GridSearchCV. This tool automatically tests dozens of hyperparameter combinations using the k-folds cross-validation you learned about, where the training data is divided into k subsets and tested iteratively.

Roadmap

  1. Create a "grid": to begin the param_grid dictionary will be containing hyperparameters and their possible values, to test out all of them, these are industry possibilities and reccomendations. There is no universal recommendation but literature does consistently agree on is which parameters matter most and which direction to push them for small datasets.
    • Multiple grid search studies find n_estimators=50 to 300 as the practical range [2]
    • Commonly tested values in classification grid searches have max_depth: [3, 5, 7, 10] with shallow (little) depths helping prevent memorization of small training sets, hece we will use 3, 5, None (to check the normal standar from RF) [3]
    • the min_samples_leaf values reduce overfitting by requiring more samples at each leaf, but because of the size of our tree the values will be small, hence test on [2, 4, 6]
    • max_features will use two types sqrt the square root of the total number of features is the standard recommendation for classification, with log2 as an alternative that further limits variance [3]
In [27]:
grid = {
    'n_estimators': [50, 100, 150, 200, 250, 300],
    'max_depth': [3, 5, None],
    'min_samples_leaf': [2, 4, 6],
    'max_features': ['sqrt', 'log2']
}
  1. Set up Grid Search: here we will be using the GridSearchCV() function is a function that is creating its own "validation" set, by splitting the data into (in this case) 5 folds with the help of the cv=5 parameter fed into the function. The hyperparameter combinations are exactly what is being tested, it trains the model and grades it with the cross-fold validation. It afterward returns the perfect combination. So using grid search helps to perfectly alter and tighten the model's architecture so it perfectly hugs the unique curves of the specific, in this case, Deforestation dataset.
    • Feed it a base model
    • for the parameters give the grid just created
    • cv = 5 means 5-Fold Cross Validation
    • return_train_score=True throws away the Training Score (the grade the model got on the folds it studied). It creates the validation set normally created during the split. It lets you detect Overfitting during the validation phase
    • scoring='f1' classes are imbalanced hence f1 score punishes extreme classes, dso its a metric to report balance in the "guesses", its a more realistic measure.
In [28]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# run gridsearch
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    grid,
    cv=5,
    scoring='f1',
    return_train_score=True
)
rf_grid.fit(X_train, y_train)
Out[28]:
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [3, 5, None],
                         'max_features': ['sqrt', 'log2'],
                         'min_samples_leaf': [2, 4, 6],
                         'n_estimators': [50, 100, 150, 200, 250, 300]},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [3, 5, None],
                         'max_features': ['sqrt', 'log2'],
                         'min_samples_leaf': [2, 4, 6],
                         'n_estimators': [50, 100, 150, 200, 250, 300]},
             return_train_score=True, scoring='f1')
RandomForestClassifier(max_depth=3, min_samples_leaf=2, random_state=42)
RandomForestClassifier(max_depth=3, min_samples_leaf=2, random_state=42)

To ensure our optimal model is truly learning the underlying patterns of deforestation rather than simply memorizing the training data, we must extract and analyze the full cv_results_ from our Grid Search. The following table aggregates the performance metrics of all tested hyperparameter combinations across the 5 cross-validation folds. By comparing the Training F1-Score (mean_train_score) against the Validation F1-Score (mean_test_score), we can monitor for variance and overfitting. Furthermore, the Standard Deviation (std_test_score) allows us to verify the stability of the model across different data distributions, ensuring that our final is good for real data.

In [29]:
# see combinations and their scores, first transofmr to df
results = pd.DataFrame(rf_grid.cv_results_)

table = results[[
    'param_n_estimators',
    'param_max_depth',
    'param_min_samples_leaf',
    'param_max_features',
    'mean_train_score',      # score on train portion
    'mean_test_score',       # score on validation portion
    'std_test_score'         # stable?
]].sort_values('mean_test_score', ascending=False)

print(table.to_string())
     param_n_estimators param_max_depth  param_min_samples_leaf param_max_features  mean_train_score  mean_test_score  std_test_score
1                   100               3                       2               sqrt          0.900585         0.626667        0.180308
19                  100               3                       2               log2          0.900585         0.626667        0.180308
20                  150               3                       2               log2          0.881164         0.583333        0.149071
2                   150               3                       2               sqrt          0.881164         0.583333        0.149071
38                  150               5                       2               sqrt          0.968399         0.583333        0.149071
37                  100               5                       2               sqrt          0.974359         0.583333        0.149071
91                  100            None                       2               log2          0.979487         0.583333        0.149071
92                  150            None                       2               log2          0.979217         0.583333        0.149071
73                  100            None                       2               sqrt          0.979487         0.583333        0.149071
74                  150            None                       2               sqrt          0.979217         0.583333        0.149071
55                  100               5                       2               log2          0.974359         0.583333        0.149071
56                  150               5                       2               log2          0.968399         0.583333        0.149071
0                    50               3                       2               sqrt          0.894552         0.580000        0.157198
18                   50               3                       2               log2          0.894552         0.580000        0.157198
41                  300               5                       2               sqrt          0.984615         0.566667        0.133333
48                   50               5                       6               sqrt          0.814335         0.566667        0.133333
95                  300            None                       2               log2          0.994872         0.566667        0.133333
102                  50            None                       6               log2          0.814335         0.566667        0.133333
84                   50            None                       6               sqrt          0.814335         0.566667        0.133333
77                  300            None                       2               sqrt          0.994872         0.566667        0.133333
66                   50               5                       6               log2          0.814335         0.566667        0.133333
24                   50               3                       4               log2          0.842151         0.566667        0.133333
30                   50               3                       6               log2          0.814335         0.566667        0.133333
59                  300               5                       2               log2          0.984615         0.566667        0.133333
12                   50               3                       6               sqrt          0.814335         0.566667        0.133333
6                    50               3                       4               sqrt          0.842151         0.566667        0.133333
96                   50            None                       4               log2          0.875006         0.553333        0.125786
42                   50               5                       4               sqrt          0.875006         0.553333        0.125786
60                   50               5                       4               log2          0.875006         0.553333        0.125786
78                   50            None                       4               sqrt          0.875006         0.553333        0.125786
9                   200               3                       4               sqrt          0.838495         0.550000        0.145297
8                   150               3                       4               sqrt          0.837315         0.550000        0.145297
7                   100               3                       4               sqrt          0.851860         0.550000        0.145297
5                   300               3                       2               sqrt          0.881164         0.550000        0.145297
4                   250               3                       2               sqrt          0.887513         0.550000        0.145297
3                   200               3                       2               sqrt          0.887886         0.550000        0.145297
11                  300               3                       4               sqrt          0.833390         0.550000        0.145297
10                  250               3                       4               sqrt          0.838195         0.550000        0.145297
22                  250               3                       2               log2          0.887513         0.550000        0.145297
16                  250               3                       6               sqrt          0.792695         0.550000        0.145297
23                  300               3                       2               log2          0.881164         0.550000        0.145297
15                  200               3                       6               sqrt          0.797457         0.550000        0.145297
17                  300               3                       6               sqrt          0.800650         0.550000        0.145297
14                  150               3                       6               sqrt          0.807316         0.550000        0.145297
106                 250            None                       6               log2          0.799362         0.550000        0.145297
105                 200            None                       6               log2          0.803806         0.550000        0.145297
107                 300            None                       6               log2          0.800650         0.550000        0.145297
104                 150            None                       6               log2          0.807316         0.550000        0.145297
99                  200            None                       4               log2          0.864195         0.550000        0.145297
34                  250               3                       6               log2          0.792695         0.550000        0.145297
35                  300               3                       6               log2          0.800650         0.550000        0.145297
33                  200               3                       6               log2          0.797457         0.550000        0.145297
32                  150               3                       6               log2          0.807316         0.550000        0.145297
29                  300               3                       4               log2          0.833390         0.550000        0.145297
28                  250               3                       4               log2          0.838195         0.550000        0.145297
21                  200               3                       2               log2          0.887886         0.550000        0.145297
26                  150               3                       4               log2          0.837315         0.550000        0.145297
25                  100               3                       4               log2          0.851860         0.550000        0.145297
27                  200               3                       4               log2          0.838495         0.550000        0.145297
43                  100               5                       4               sqrt          0.875838         0.550000        0.145297
71                  300               5                       6               log2          0.800650         0.550000        0.145297
70                  250               5                       6               log2          0.799362         0.550000        0.145297
68                  150               5                       6               log2          0.807316         0.550000        0.145297
69                  200               5                       6               log2          0.803806         0.550000        0.145297
63                  200               5                       4               log2          0.864195         0.550000        0.145297
64                  250               5                       4               log2          0.864195         0.550000        0.145297
61                  100               5                       4               log2          0.875838         0.550000        0.145297
89                  300            None                       6               sqrt          0.800650         0.550000        0.145297
53                  300               5                       6               sqrt          0.800650         0.550000        0.145297
52                  250               5                       6               sqrt          0.799362         0.550000        0.145297
50                  150               5                       6               sqrt          0.807316         0.550000        0.145297
65                  300               5                       4               log2          0.864195         0.550000        0.145297
51                  200               5                       6               sqrt          0.803806         0.550000        0.145297
45                  200               5                       4               sqrt          0.864195         0.550000        0.145297
47                  300               5                       4               sqrt          0.864195         0.550000        0.145297
46                  250               5                       4               sqrt          0.864195         0.550000        0.145297
83                  300            None                       4               sqrt          0.864195         0.550000        0.145297
88                  250            None                       6               sqrt          0.799362         0.550000        0.145297
87                  200            None                       6               sqrt          0.803806         0.550000        0.145297
86                  150            None                       6               sqrt          0.807316         0.550000        0.145297
79                  100            None                       4               sqrt          0.875838         0.550000        0.145297
101                 300            None                       4               log2          0.864195         0.550000        0.145297
100                 250            None                       4               log2          0.864195         0.550000        0.145297
97                  100            None                       4               log2          0.875838         0.550000        0.145297
81                  200            None                       4               sqrt          0.864195         0.550000        0.145297
82                  250            None                       4               sqrt          0.864195         0.550000        0.145297
98                  150            None                       4               log2          0.857528         0.533333        0.124722
31                  100               3                       6               log2          0.821862         0.533333        0.124722
13                  100               3                       6               sqrt          0.821862         0.533333        0.124722
40                  250               5                       2               sqrt          0.979487         0.533333        0.124722
39                  200               5                       2               sqrt          0.973527         0.533333        0.124722
44                  150               5                       4               sqrt          0.857528         0.533333        0.124722
93                  200            None                       2               log2          0.989744         0.533333        0.124722
85                  100            None                       6               sqrt          0.839279         0.533333        0.124722
62                  150               5                       4               log2          0.857528         0.533333        0.124722
75                  200            None                       2               sqrt          0.989744         0.533333        0.124722
49                  100               5                       6               sqrt          0.839279         0.533333        0.124722
57                  200               5                       2               log2          0.973527         0.533333        0.124722
58                  250               5                       2               log2          0.979487         0.533333        0.124722
67                  100               5                       6               log2          0.839279         0.533333        0.124722
103                 100            None                       6               log2          0.839279         0.533333        0.124722
94                  250            None                       2               log2          0.989744         0.533333        0.124722
76                  250            None                       2               sqrt          0.989744         0.533333        0.124722
80                  150            None                       4               sqrt          0.857528         0.533333        0.124722
36                   50               5                       2               sqrt          0.989744         0.529091        0.112597
54                   50               5                       2               log2          0.989744         0.529091        0.112597
72                   50            None                       2               sqrt          0.989744         0.529091        0.112597
90                   50            None                       2               log2          0.989744         0.529091        0.112597

Analysis

  1. The validation scores across the top configurations remain tightly clustered (ranging from 0.55 to 0.62). While the standard deviation (~0.14 to ~0.18) highlights the normal volatility of evaluating on extremely small validation folds, the consistency of the top-ranking parameters proves that our model is systematically finding the same underlying mathematical boundaries, confirming that there is reliability on our cross-validation strategy.
  2. As expected with a small dataset (108 total instances), highly complex models aggressively overfit the training data. For example, configurations allowing unlimited tree depth (max_depth=None) achieved almost perfect Training F1-Scores (close to 0.99) but suffered poor Validation F1-Scores (close to 0.56). This indicates that deep trees are memorizing noise rather than learning patterns that can generalize.
  3. The depth helps (regularization), the best configuration could kinda combat this overfitting through regularization. The highest performing model restricted the trees to a shallow depth (max_depth=3). By forcing the model to remain simple, the Training F1-Score decreased to 0.90, but the Validation F1-Score improved to its maximum of aprox 0.63, proving that shallower trees generalize much better for our deforestation thresholds.

Boosting¶

So, a Random Forest (Bagging) model builds deep trees completely independent of one another, and they all vote at the end. Boosting is a sequential approach where trees learn from mistakes of the past one. Instead of building independent trees at the same time, it builds them one by one. To help with this, XGBoost will follow. Helps model better icomplex interacting socioeconomic factors, not just averaged results. Because our dataset exhibits class imbalance, this sequential focus on "hard-to-predict" observations naturally forces the model to pay closer attention to the minority class (High-Risk countries).

XGBoost will follow to the exact same GridSearchCV environment, utilizing 5-Fold Stratified Cross-Validation and idfferent specific parameter tuning. Hence, its imperative to first define the grid to feed the model.

  • n_estimators will use the same set to avoid overfitting and follow literature standards: [50, 100, 150, 200, 300]
  • learning_rate represents how agressive a tree is compared to past mistakes, low rates are safer but slower, hence a variation of low and big but not the biggest is a good start [0.01, 0.05, 0.1]
  • max_depth learning from past shown the bigger the more overfit they can cause, letting trees grow indefinite results in small datasets wrong, because Boosting is not parallel processing the params will be [3, 4, 5] (taking away the None, because of slow overfitting computations)
  • colsample_bytree fraction of features used, to have both a Random Forest logic model and a normal no limit one we will feed 50% of the Features of the number of features and the 1.0 for full.
  • reg_lambda make it dont punish overfit, make it punish it a little to see differences.
In [30]:
boosting_grid = {
    'n_estimators':      [50, 100, 150, 200],
    'learning_rate':     [0.01, 0.05, 0.1],
    'max_depth':         [3, 4, 5],
    'colsample_bytree':  [0.5, 1],
    'reg_lambda':        [1, 2],
}

We configure GridSearchCV with the following parameters to ensure a rigorous and fair evaluation:

  • cv=5 (Internal Validation): Uses 5-Fold Cross-Validation to evaluate performance safely without leaking the Test Set.

  • eval_metric='logloss': Strictly penalizes confident but incorrect predictions, forcing the model to calibrate its probabilities.

  • scoring='f1': Prioritizes the balance of Precision and Recall to properly handle our class imbalance.

  • return_train_score=True: Saves the training fold scores so we can manually check for overfitting later.

  • n_jobs=-1: Utilizes all available CPU cores to maximize training speed.

In [31]:
from xgboost import XGBClassifier

xgb_grid = GridSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    param_grid=boosting_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    return_train_score=True
)

xgb_grid.fit(X_train, y_train)
/usr/local/lib/python3.12/dist-packages/xgboost/training.py:200: UserWarning: [17:51:37] WARNING: /__w/xgboost/xgboost/src/learner.cc:782: 
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Out[31]:
GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, device=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False,
                                     eval_metric='logloss', feature_types=None,
                                     feature_weights=None, gamma=None,
                                     grow_policy=None, importance_type=None,
                                     interaction_constraint...
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     multi_strategy=None, n_estimators=None,
                                     n_jobs=None, num_parallel_tree=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.5, 1],
                         'learning_rate': [0.01, 0.05, 0.1],
                         'max_depth': [3, 4, 5],
                         'n_estimators': [50, 100, 150, 200],
                         'reg_lambda': [1, 2]},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, device=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False,
                                     eval_metric='logloss', feature_types=None,
                                     feature_weights=None, gamma=None,
                                     grow_policy=None, importance_type=None,
                                     interaction_constraint...
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     multi_strategy=None, n_estimators=None,
                                     n_jobs=None, num_parallel_tree=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.5, 1],
                         'learning_rate': [0.01, 0.05, 0.1],
                         'max_depth': [3, 4, 5],
                         'n_estimators': [50, 100, 150, 200],
                         'reg_lambda': [1, 2]},
             return_train_score=True, scoring='f1')
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.5, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.5, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=50, n_jobs=None,
              num_parallel_tree=None, ...)
In [32]:
results = pd.DataFrame(xgb_grid.cv_results_)

table = results[[
    'param_n_estimators',
    'param_learning_rate',
    'param_max_depth',
    'param_colsample_bytree',
    'param_reg_lambda',
    'mean_train_score',
    'mean_test_score',
    'std_test_score',
    'rank_test_score'      # rank 1 = best combo
]].sort_values('mean_test_score', ascending=False)

print(table.head(15).to_string())
    param_n_estimators  param_learning_rate  param_max_depth  param_colsample_bytree  param_reg_lambda  mean_train_score  mean_test_score  std_test_score  rank_test_score
32                  50                 0.05                4                     0.5                 1          0.989744         0.526234        0.194987                1
49                  50                 0.10                3                     0.5                 2          0.989744         0.526234        0.194987                1
57                  50                 0.10                4                     0.5                 2          0.994872         0.526234        0.194987                1
64                  50                 0.10                5                     0.5                 1          0.994872         0.526234        0.194987                1
65                  50                 0.10                5                     0.5                 2          0.994872         0.526234        0.194987                1
56                  50                 0.10                4                     0.5                 1          0.994872         0.526234        0.194987                1
43                 100                 0.05                5                     0.5                 2          0.994872         0.526234        0.194987                1
45                 150                 0.05                5                     0.5                 2          0.994872         0.526234        0.194987                1
42                 100                 0.05                5                     0.5                 1          0.994872         0.526234        0.194987                1
33                  50                 0.05                4                     0.5                 2          0.968129         0.526234        0.194987                1
35                 100                 0.05                4                     0.5                 2          0.994872         0.526234        0.194987                1
37                 150                 0.05                4                     0.5                 2          0.994872         0.526234        0.194987                1
27                 100                 0.05                3                     0.5                 2          0.989744         0.526234        0.194987                1
24                  50                 0.05                3                     0.5                 1          0.978656         0.526234        0.194987                1
25                  50                 0.05                3                     0.5                 2          0.968129         0.526234        0.194987                1

Analysis

  1. Severe Overfitting on Training Data. Across all top configurations, the mean_train_score approaches 1.0 (96% - 99%). Because Gradient Boosting sequentially hunts down and corrects errors, the algorithm aggressively memorized the 82 training instances. Despite applying strict regularization parameters (max_depth=3, reg_lambda=2), the model was too complex for the limited data, failing to generalize.
  2. The mean_test_score capped at 0.526, which is our Random Forest baseline (0.626). Additionaly, dozens of different hyperparameter configurations tied for Rank 1 with the exact same validation score. In a small validation fold (~16 instances), this indicates that the overfitted models all collapsed into predicting the exact same subset of countries.
  3. The std_test_score hovers around ~0.19. This massive variance indicates that the model's success is highly dependent on how the cross-validation folds were randomly split.

While XGBoost is a state-of-the-art model for large tabular datasets, its aggressive sequential learning makes it highly prone to overfitting on extremely small, noisy datasets. In this context, between the two ensemble models, the independent averaging mechanism of the Random Forest proved to be much more robust and reliable.

Support Vector Machines¶

Compared to ensemble models, instead of building decision branches, an SVM plots every country in a multi-dimensional space and attempts to slide a smooth, continuous mathematical boundary (a hyperplane) to separate the Low-Risk and High-Risk classifications.

Because SVMs optimize based on calculating the literal geometric distance between data points, they are highly sensitive to unscaled variables. If we do not scale our features, massive absolute numbers (like 'Population') will mathematically overpower and drown out smaller percentage metrics (like 'Unemployment rate'). To prepare the data for the SVM, we will standardize our features using Scikit-Learn's StandardScaler. Crucially, we will use .fit_transform() on the training data so the scaler learns the mathematical baseline, but we will strictly use .transform() on the test data. This ensures the scaler does not peek at the test set's averages, preventing Data Leakage.

In [33]:
from sklearn.preprocessing import StandardScaler

# define the scaler
scaler = StandardScaler()

# fit the scaler to the training data
X_train_scaled = scaler.fit_transform(X_train)

# only transform the test to avoid data leakage
X_test_scaled = scaler.transform(X_test)

Parameter definition and tuning decisions

  • The C parameter controls the trade-off between classifying training points correctly and maintaining a generalized decision boundary. A higher C creates a strict, narrow margin that penalizes misclassifications heavily (which can lead to overfitting). A lower C encourages a wider, "softer" margin, creating a simpler decision function at the cost of some training accuracy [4]. Because our dataset is small and prone to noise, smaller values of C are recommended to maximize generalization. Therefore, our search grid will test powers of 10: [0.1, 1, 10] [5].
  • gamma defines how far the mathematical influence of a single training example reaches [4]. This directly dictates the non-linear behavior of our decision boundary, which is critical for our hypothesis that deforestation risk is defined by complex, continuous interactions. A high gamma means data points only exert influence at a very close range, leading to a highly complex, "wiggly" boundary. A low gamma means points put influence farther away, resulting in a smoother boundary. To prevent overfitting on our small dataset, we will avoid excessively large values and test a localized range [0.1, 1, 10, 100] following literature recommendations [6].
  • kernel function transforms the input space into a higher-dimensional feature space, making data linearly separable. Common kernel functions include linear, polynomial, RBF, and sigmoid. For linear kernel, gamma has no impact. In this case, linear tested first because if your data is linearly separable, it's simpler and less prone to overfitting. You test rbf because it's the most universally effective kernel and handles non-linear patterns.
In [34]:
svm_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.1, 1],
    'kernel': ['linear', 'rbf'],
}

With our data properly scaled, we will now construct the GridSearchCV for Support Vector Machine. We will maintain the exact same rigorous cross-validation strategy used for our ensemble models to ensure a fair, one-to-one comparison.

  • We continue to use our 5-Fold Stratified Cross-Validation to dynamically create validation sets within the training data, ensuring we find the best hyperparameters without ever exposing the model to the locked X_test cv=5
  • return_train_score=True. We will retain the training fold scores to analyze the mathematical behavior and actively monitor for overfitting.
  • We will set n_jobs=-1 to use all available CPU cores to drastically reduce the training time.
In [35]:
from sklearn.svm import SVC

svm_grid = GridSearchCV(
    SVC(probability=True, random_state=42),
    param_grid=svm_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    return_train_score=True
)

svm_grid.fit(X_train_scaled, y_train)
Out[35]:
GridSearchCV(cv=5, estimator=SVC(probability=True, random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10], 'gamma': [0.1, 1],
                         'kernel': ['linear', 'rbf']},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=SVC(probability=True, random_state=42), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10], 'gamma': [0.1, 1],
                         'kernel': ['linear', 'rbf']},
             return_train_score=True, scoring='f1')
SVC(C=0.1, gamma=0.1, kernel='linear', probability=True, random_state=42)
SVC(C=0.1, gamma=0.1, kernel='linear', probability=True, random_state=42)
In [36]:
results = pd.DataFrame(svm_grid.cv_results_)

table = results[[
    'param_kernel',
    'param_C',
    'param_gamma',
    'mean_train_score',
    'mean_test_score',
    'std_test_score',
    'rank_test_score'
]].sort_values('mean_test_score', ascending=False)

print(table.head(15).to_string())
   param_kernel  param_C  param_gamma  mean_train_score  mean_test_score  std_test_score  rank_test_score
0        linear      0.1          0.1          0.818194         0.619814        0.174714                1
2        linear      0.1          1.0          0.818194         0.619814        0.174714                1
9           rbf     10.0          0.1          1.000000         0.619402        0.192128                3
5           rbf      1.0          0.1          0.922267         0.605556        0.187906                4
6        linear      1.0          1.0          0.870112         0.556190        0.157140                5
4        linear      1.0          0.1          0.870112         0.556190        0.157140                5
10       linear     10.0          1.0          0.935311         0.532900        0.175319                7
8        linear     10.0          0.1          0.935311         0.532900        0.175319                7
3           rbf      0.1          1.0          0.000000         0.000000        0.000000                9
1           rbf      0.1          0.1          0.000000         0.000000        0.000000                9
7           rbf      1.0          1.0          1.000000         0.000000        0.000000                9
11          rbf     10.0          1.0          1.000000         0.000000        0.000000                9

Analysis Until now, evaluating the SVM grid search reveals a fascinating insight about the underlying geometric structure of our deforestation dataset. Our initial hypothesis assumed a highly complex, non-linear boundary, but the cross-validation proved the opposite.

  1. The top-performing architectures used a purely linear kernel with strong regularization (a soft margin of C=0.1). This configuration achieved a Validation F1-Score of 0.619 while keeping the Training F1-Score at a 0.818 (not to overfitted). This narrow train-test gap indicates highly stable generalization.
  2. When the algorithm was allowed to use the RBF kernel (e.g., C=10.0, gamma=0.1), it had a lot of overfitting. It achieved a perfect 1.0 Training F1-Score by drawing specific boundaries around the training points, but its Validation score failed to go oover the simple linear model.
  3. A high gamma severely restricts the radius of influence of the support vectors. The model memorized the training data perfectly (Training Score = 1.0) by drawing tiny, isolated islands around the High-Risk instances, rendering it completely incapable of generalizing to unseen validation data.
  4. Across our top linear models, the std_test_score sits at around 0.174. This relatively mid variance indicates that the Validation F1-Score fluctuates significantly depending on how the cross-validation folds were partitioned. This is expected because of the dataset's size, with only ~16 instances per validation fold.

Neural Networks¶

Finally, for the Neural Network model, hyperparameter tuning via GridSearchCV is not used. Unlike tree-based models, neural networks require defining the architecture itself — the number of layers, neurons, and training configuration — rather than searching over a fixed set of algorithmic parameters. GridSearchCV is particularly inefficient for neural networks because each parameter combination requires a full training run across many epochs, making the computational cost prohibitive. Neural networks excel at capturing complex non-linear relationships between features, but this same flexibility makes them highly prone to overfitting on small datasets. To reduce this, regularization will be applied trhoough regularization. Additionally, as neural networks rely on gradient-based optimization, they are highly sensitive to feature scale, so StandardScaler preprocessing is applied, the same scaling used for the SVM model.

The architecture is defined as follows based on literature with similar dataset sizes

  • Activation Function: ReLU (Rectified Linear Unit) is used for all hidden layers. Compared to sigmoid in hidden layers, ReLU effectively avoids the vanishing gradient problem and has become the standard default for hidden layer activations in modern neural networks.

  • Output Later: A single neuron with sigmoid activation is used in the output layer, which maps the network's output to a probability between 0 and 1, for binary classification. The predicted class is assigned as high risk if the output probability is > than thershold.

  • Optimizer and learning rate Adam is selected as the optimizer with the default learning rate of lr=0.001. Adam as computationally efficient with little memory requirements and showed that its default hyperparameters require little to no tuning in practice [7]. There i8s also literture that demonstrate that the combination of ReLU and Adam consistently outperforms others.

  • Layers, The neuron count per layer follows the three rules of thumb[9]:

    • The number of hidden neurons should be between the size of the input and output layers
    • The number of hidden neurons should be 2/3 of the input layer size plus the output layer size
    • The number of hidden neurons should be less than twice the input layer size

    Applying Rule 2 to our model (20 input features, 1 output neuron):

    1. Layer 1: (2/3 × 20) + 1 ≈ 14 → rounded to 16 (clean value, within the Rule 3 of <40)
    2. Layer 2: (2/3 × 14) + 1 ≈ 10 → rounded to 8 (follows the pattern toward the output)

The coding steps:

  1. Initialize Environment: Set the random seed to guarantee completely reproducible results.
  2. Construct Architecture: Instantiate the Sequential model and stack the three defined layers (16 -> 8 -> 1).
  3. Compile the Model: Configure the network to use the Adam optimizer, track accuracy, and optimize for binary_crossentropy loss.
  4. Configure Regularization: Establish EarlyStopping (patience = 10, restore_best_weights=True) to automatically halt training and revert to the optimal mathematical state before overfitting occurs.
  5. Training: Train the network using a computationally efficient batch size of 16 [9] and an automatic 80/20 internal validation split (validation_split=0.2) to monitor performance dynamically.
In [37]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# Set the random seed for reproducible results
tf.random.set_seed(42)

nn_model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dense(8, activation='relu'),
    Dense(1,  activation='sigmoid')
])

nn_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

history = nn_model.fit(
    X_train_scaled, y_train,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=1
)

nn_model.summary()
Epoch 1/100
/usr/local/lib/python3.12/dist-packages/keras/src/layers/core/dense.py:106: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
5/5 ━━━━━━━━━━━━━━━━━━━━ 2s 107ms/step - accuracy: 0.5231 - loss: 0.6993 - val_accuracy: 0.5882 - val_loss: 0.7116
Epoch 2/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 24ms/step - accuracy: 0.5231 - loss: 0.6749 - val_accuracy: 0.5882 - val_loss: 0.6933
Epoch 3/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - accuracy: 0.6154 - loss: 0.6557 - val_accuracy: 0.6471 - val_loss: 0.6771
Epoch 4/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.6769 - loss: 0.6385 - val_accuracy: 0.6471 - val_loss: 0.6621
Epoch 5/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.7231 - loss: 0.6224 - val_accuracy: 0.7059 - val_loss: 0.6477
Epoch 6/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.7231 - loss: 0.6073 - val_accuracy: 0.8235 - val_loss: 0.6340
Epoch 7/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.7385 - loss: 0.5931 - val_accuracy: 0.8824 - val_loss: 0.6209
Epoch 8/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.7692 - loss: 0.5792 - val_accuracy: 0.8824 - val_loss: 0.6086
Epoch 9/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.7846 - loss: 0.5658 - val_accuracy: 0.8824 - val_loss: 0.5972
Epoch 10/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.7846 - loss: 0.5527 - val_accuracy: 0.8824 - val_loss: 0.5864
Epoch 11/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.7692 - loss: 0.5401 - val_accuracy: 0.8824 - val_loss: 0.5766
Epoch 12/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.7846 - loss: 0.5282 - val_accuracy: 0.8824 - val_loss: 0.5675
Epoch 13/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.7846 - loss: 0.5165 - val_accuracy: 0.8824 - val_loss: 0.5588
Epoch 14/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.7846 - loss: 0.5051 - val_accuracy: 0.8824 - val_loss: 0.5503
Epoch 15/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8154 - loss: 0.4942 - val_accuracy: 0.8824 - val_loss: 0.5411
Epoch 16/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8308 - loss: 0.4838 - val_accuracy: 0.8824 - val_loss: 0.5325
Epoch 17/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8462 - loss: 0.4737 - val_accuracy: 0.8824 - val_loss: 0.5248
Epoch 18/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8462 - loss: 0.4639 - val_accuracy: 0.8824 - val_loss: 0.5178
Epoch 19/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step - accuracy: 0.8769 - loss: 0.4543 - val_accuracy: 0.8824 - val_loss: 0.5114
Epoch 20/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8769 - loss: 0.4449 - val_accuracy: 0.8824 - val_loss: 0.5057
Epoch 21/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8769 - loss: 0.4358 - val_accuracy: 0.8824 - val_loss: 0.5006
Epoch 22/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8769 - loss: 0.4268 - val_accuracy: 0.8824 - val_loss: 0.4961
Epoch 23/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8769 - loss: 0.4180 - val_accuracy: 0.8824 - val_loss: 0.4916
Epoch 24/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8769 - loss: 0.4092 - val_accuracy: 0.8824 - val_loss: 0.4875
Epoch 25/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8769 - loss: 0.4003 - val_accuracy: 0.8824 - val_loss: 0.4839
Epoch 26/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3917 - val_accuracy: 0.8824 - val_loss: 0.4807
Epoch 27/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3836 - val_accuracy: 0.8824 - val_loss: 0.4775
Epoch 28/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 25ms/step - accuracy: 0.8923 - loss: 0.3757 - val_accuracy: 0.8824 - val_loss: 0.4735
Epoch 29/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3680 - val_accuracy: 0.8824 - val_loss: 0.4702
Epoch 30/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8769 - loss: 0.3607 - val_accuracy: 0.8824 - val_loss: 0.4675
Epoch 31/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8769 - loss: 0.3535 - val_accuracy: 0.8824 - val_loss: 0.4652
Epoch 32/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3464 - val_accuracy: 0.8824 - val_loss: 0.4633
Epoch 33/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3396 - val_accuracy: 0.8824 - val_loss: 0.4616
Epoch 34/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3332 - val_accuracy: 0.8824 - val_loss: 0.4602
Epoch 35/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8923 - loss: 0.3269 - val_accuracy: 0.8824 - val_loss: 0.4599
Epoch 36/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.3209 - val_accuracy: 0.8824 - val_loss: 0.4602
Epoch 37/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step - accuracy: 0.8923 - loss: 0.3150 - val_accuracy: 0.8235 - val_loss: 0.4606
Epoch 38/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8923 - loss: 0.3094 - val_accuracy: 0.8235 - val_loss: 0.4613
Epoch 39/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8923 - loss: 0.3041 - val_accuracy: 0.8235 - val_loss: 0.4621
Epoch 40/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step - accuracy: 0.8923 - loss: 0.2989 - val_accuracy: 0.8235 - val_loss: 0.4630
Epoch 41/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8769 - loss: 0.2939 - val_accuracy: 0.8235 - val_loss: 0.4638
Epoch 42/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.2891 - val_accuracy: 0.8235 - val_loss: 0.4646
Epoch 43/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 21ms/step - accuracy: 0.8923 - loss: 0.2843 - val_accuracy: 0.8235 - val_loss: 0.4656
Epoch 44/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step - accuracy: 0.8923 - loss: 0.2797 - val_accuracy: 0.8235 - val_loss: 0.4666
Epoch 45/100
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step - accuracy: 0.8923 - loss: 0.2752 - val_accuracy: 0.8235 - val_loss: 0.4676
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_3 (Dense)                 │ (None, 16)             │           336 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 8)              │           136 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 1)              │             9 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,445 (5.65 KB)
 Trainable params: 481 (1.88 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 964 (3.77 KB)

Because we tood the accuracy (keras has no other option) we will use the metrics from sklearn to get the f1 score of the model: to9 be able to compare to our other models, especially in the case of the final test data.

In [38]:
from sklearn.metrics import f1_score

# 1. Ask the model to predict probabilities for the training data
nn_train_probs = nn_model.predict(X_train_scaled, verbose=0)

# 2. Convert probabilities (> 0.5) into hard 1s and 0s
nn_train_preds = (nn_train_probs > 0.5).astype(int)

# 3. Calculate the F1 Score
nn_train_f1 = f1_score(y_train, nn_train_preds)
print(f"Neural Network Training F1-Score: {nn_train_f1:.4f}")
Neural Network Training F1-Score: 0.8163

An F1-Score of 0.8235 on the training shows a good behaviour. It proves the network successfully learned the complex, underlying patterns of the data without aggressively memorizing the noise (completely avoiding the 1.000 overfitting seen with XGBoost and the RBF SVM). The EarlyStopping regularization helped.

However, a strong training score is only practice. The ultimate test of this model's architecture will be how well it generalizes to unseen data.

Final Testing¶

This final shows the holded-out test set to evaluate all four optimized models on completely unseen data. To ensure a fair comparison, each algorithm is evaluated using its mathematically appropriate data format and scored using the exact same target metric (F1-Score).

  • The Random Forest receives the raw X_test, as standard decision trees dont need feature scaling. Conversely, XGBoost, the SVM, and the Neural Network are fed X_test_scaled to prevent large magnitude features from destroying their gradient and distance calculations through dimensions.
  • Because the Keras Neural Network is designed to output a continuous probability (via the sigmoid activation function), a strict > 0.5 boolean threshold is applied. This maps the mathematical probabilities back into hard binary classifications (0 or 1) so it can be evaluated fairly against the other models.
  • The final F1-scores are calculated and aggregated into a structured Pandas DataFrame. Sorting this table in descending order strips away the training biases and reveals a clear, definitive ranking of which architecture truly generalizes best to real-world deforestation data.
In [39]:
# Random Forest Predictions (Uses original X_test)
rf_preds = rf_grid.predict(X_test)
rf_f1 = f1_score(y_test, rf_preds)

# XGBoost Predictions
xgb_preds = xgb_grid.predict(X_test)
xgb_f1 = f1_score(y_test, xgb_preds)

# SVM Predictions (Uses X_test_scaled)
svm_preds = svm_grid.predict(X_test_scaled)
svm_f1 = f1_score(y_test, svm_preds)

# Neural Network Predictions (Uses X_test_scaled)
nn_probs = nn_model.predict(X_test_scaled, verbose=0)
# convert probability to binary
nn_preds = (nn_probs > 0.5).astype(int)
nn_f1 = f1_score(y_test, nn_preds)

# Get the final comparation
leaderboard = pd.DataFrame({
    'Model': ['Random Forest', 'XGBoost', 'SVM', 'Neural Network'],
    'Test F1-Score': [rf_f1, xgb_f1, svm_f1, nn_f1]
})

# Sort the leaderboard from highest score to lowest score
leaderboard = leaderboard.sort_values(by='Test F1-Score', ascending=False).reset_index(drop=True)

print(leaderboard.to_string(index=False))
         Model  Test F1-Score
  SVM (Linear)       0.727273
Neural Network       0.666667
 Random Forest       0.500000
       XGBoost       0.285714

Conclusion¶

The final evaluation on the isolated test set reveals new dynamics regarding algorithm selection for small datasets, a subject often not that studied in formal academic research. The Linear Support Vector Machine (SVM) achieved the highest Test F1-Score (aprox 0.73), outperforming the Artificial Neural Network (0.6666). A critical reflection on complexity, overfitting, and interpretability reveals that an increase in mathematical complexity does not inherently yield better results.

  1. Performance and Consistency (The Triumph of Simplicity) The Linear SVM demonstrated exceptional predictive capability and consistency. By relying on a flat, mathematical hyperplane with a wide margin, the SVM avoided memorizing localized noise and successfully generalized to unseen data. While the Neural Network performed adequately, it experienced a noticeable drop from its training score (~0.82 to 0.66), indicating that the SVM’s rigid mathematical boundary is significantly better for this specific deforestation context.

  2. Did an increase in model complexity translate to clear improvements? The answer is no. The simplest algorithm secured first place. Conversely, some of the most highly complex, tree-building XGBoost model suffered a severe predictive collapse (Test F1-Score: 0.2857) and Neural Netwoks also didn't imply a higher real test data prediction. This proves that introducing excessive algorithmic complexity to a small dataset can be rather detrimental while also more processing and efficiency consumming. A simple, straight geometric trend seems superior to the complex, non-linear boundaries attempted by XGBoost and Deep Learning when data is scarce.

  3. Risks of Overfitting The severe limitations of the dataset size (~100 total instances) inherently introduce massive overfitting risks. The poor performance of XGBoost serves as a textbook example of an algorithm memorizing the training data but failing in the real world. Furthermore, the variance observed in the Neural Network illustrates that relying on Deep Learning for such small datasets requires regularization (and a degree of luck); otherwise, it falls into the same overfitting trap that critically hindered the ensemble models.

  4. Interpretability and Practical Application The final decision on model deployment strongly favors the Linear SVM across all metrics.

    • The "Black Box": While the Neural Network is powerful, it remains a "black box" that cannot easily explain the socioeconomic reasons why a region is at risk.
    • The "Understander": If the objective of this model is to advise global governments on environmental policy, interpretability is important. Because the Linear SVM showed stable and the best performance for now, there is no need to sacrifice transparency for accuracy. The SVM can explicitly reveal which specific socioeconomic factors drive deforestation via its interpretable feature weights, making it the undisputed champion for both mathematical performance and practical policy-making. This can also work with Random Forests that show weights of the factors and other ways of seeing then input weights per features.

Further Improvements

  1. The Neural Network could undergo a more exhaustive hyperparameter and architectural optimization process. Future iterations might explore deeper topologies or different neuron configurations, provided they continue to adhere to established architectural heuristics to prevent vanishing gradients.
  2. Subsequent analyses could expand the GridSearchCV parameter grids for the classical models. By incorporating new hyperparameter combinations, the algorithms could be specifically fine-tuned to address the overfitting patterns and geometric boundary discoveries identified in this study. Especially looking into the XGBoosting like alpha, lambda, and max_depth.
  3. The strict limitations of the dataset size could be mitigated by introducing a time dimension (e.g., tracking historical deforestation metrics year-over-year) to exponentially increase the number of training samples. Additionally, advanced imputation techniques or synthetic data generation could be leveraged to include countries currently missing from the dataset (at the end there are 200+ countries).
  4. A dedicated feature selection phase could be introduced prior to training. Isolating only the most mathematically predictive variables would reduce noise and dimensionality, which would be particularly beneficial for improving the generalization capabilities of the Random Forest and XGBoost models.

References¶

[1] Teo, H. C., Sarira, T. V., Tan, A. R. P., Cheng, Y., & Koh, L. P. (2024). Charting the future of high forest low deforestation jurisdictions. Proceedings of the National Academy of Sciences of the United States of America, 121(37), e2306496121. https://doi.org/10.1073/pnas.2306496121

[2] https://link.springer.com/article/10.1186/s12911-021-01688-3

[3] https://andrewpwheeler.com/2022/10/10/hyperparameter-tuning-for-random-forests/

[4] https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

[5] https://arxiv.org/pdf/1802.09596

[6] https://www.geeksforgeeks.org/machine-learning/gamma-parameter-in-svm/

[7] https://arxiv.org/pdf/1803.08375

[8] https://www.researchgate.net/publication/263889761_Introduction_to_Neural_Networks_for_Java

[9] https://arxiv.org/pdf/2003.12843