DeforestationClassifier-SVMsEnsembleANNs

Deforestation Multi-Model Classification

Predicting Critical Forest Loss Risk Across Countries

Measuring deforestation and accurately classifying its risks is fundamentally about protecting the planet’s future, transforming raw environmental data into actionable insights for global conservation. Tackling this ecological crisis through data science brings a deep sense of personal purpose to the work, shifting the narrative from abstract climate anxiety to concrete, problem-solving momentum. However, because the drivers of deforestation—ranging from macro-economic policies to localized agricultural practices—are incredibly diverse, no single algorithm can perfectly capture reality. Testing a wide array of machine learning approaches, from rigid linear boundaries to complex deep learning architectures, is absolutely essential to uncover the true underlying geometry of these threats, ensuring that future environmental policies are guided by the most mathematically sound, transparent, and robust evidence possible.

📖 Purpose of the Study

Can country-level socioeconomic indicators reliably identify which nations are at critical deforestation risk?

In this phase of the project, deforestation is framed as a binary classification task and compared on four modeling families with different assumptions: Random Forest, XGBoost, SVM, and a Neural Network. The goal is not only to maximize score, but to stress-test whether different model types converge on the same risk signal under a leakage-aware workflow. This notebook is built as a portfolio piece and as a practical ecological screening experiment: if we can detect higher-risk countries early, policy attention can be prioritized before irreversible forest loss accelerates. The rest of the notebooks prior to this, where data comes from and where it was cleaned, can be found in the ML and AI porfolio.

📊 The Data

The modeling dataset (classified_deforestation.csv) contains country-level records with engineered socioeconomic and demographic predictors and a binary target:

Deforestation_Critical = 1 for high-risk deforestation jurisdictions and 0 otherwise.
Class balance in this notebook: 72 low-risk / 31 high-risk (~70/30).
Train/test split: 80/20 stratified (82 train, 21 test).

The notebook uses ~20 predictors in the final modeling matrix (after cleaning and preparation done in prior phases). The ataset has been previously explored and cleaned, to avoid the overclesaning and also be able to tst different a´pproaches on the same dataset as new concepts and model are learnt.

🛠 Main Technical Applications

Stratified train/test split to preserve class distribution.
Model-specific preprocessing discipline:
- Random Forest and XGBoost trained/evaluated on raw features.
- SVM and Neural Network trained/evaluated on standardized features (StandardScaler fit on train only).
5-fold GridSearchCV (F1 scoring) for Random Forest, XGBoost, and SVM.
Early stopping in the neural network (monitor='val_loss', restore_best_weights=True).
Cross-model comparison using F1-score on the held-out test set.

🚀 Results Snapshot (Current Notebook Run)

Model	Test F1-Score
SVM	0.7273
Neural Network	0.6667
Random Forest	0.5000
XGBoost	0.2857

These are the direct outputs from the current notebook execution (DeforestationClassifier_SVMsANNsEnsemble.ipynb).

🔍 Key Observations

SVM currently leads this run on minority-sensitive performance (F1).
Neural Network performs competitively despite small sample size.
Tree ensembles underperform in this specific split/run, which is a useful reminder that small country-level datasets can produce unstable rankings across model families.
Methodological consistency matters: using each model with the correct preprocessing pipeline is essential for fair comparison.

⚠️ Methodology Note (Important for Reporting)

In strict evaluation practice, the held-out test set should be used once for final reporting after model selection is finalized through cross-validation/validation data. In this notebook, test-set comparison is used as a practical benchmark table. If this is for formal reporting, present this clearly and avoid claiming the test-selected top model as an unbiased final winner.

📁 Project Files

DeforestationClassifier_SVMsANNsEnsemble.ipynb: Main notebook with preprocessing, tuning, training, and model comparison.
classified_deforestation.csv: Classification-ready dataset.
DeforestationClassifier_SVMsANNsEnsemble.html: Readable main notebook with all teh processing, models, etc in html format

🧰 Tools

Python · pandas · numpy · scikit-learn · xgboost · tensorflow/keras · matplotlib

👩‍💻 Personal Note

I built this project as part of my ongoing deforestation modeling series because I wanted to push beyond a single algorithm and test how robust the risk signal is across very different model types. For me, this is where data science gets meaningful: not just fitting models, but asking whether the conclusions still hold when the math changes.