Measuring deforestation and accurately classifying its risks is fundamentally about protecting the planet’s future, transforming raw environmental data into actionable insights for global conservation. Tackling this ecological crisis through data science brings a deep sense of personal purpose to the work, shifting the narrative from abstract climate anxiety to concrete, problem-solving momentum. However, because the drivers of deforestation—ranging from macro-economic policies to localized agricultural practices—are incredibly diverse, no single algorithm can perfectly capture reality. Testing a wide array of machine learning approaches, from rigid linear boundaries to complex deep learning architectures, is absolutely essential to uncover the true underlying geometry of these threats, ensuring that future environmental policies are guided by the most mathematically sound, transparent, and robust evidence possible.
Can country-level socioeconomic indicators reliably identify which nations are at critical deforestation risk?
In this phase of the project, deforestation is framed as a binary classification task and compared on four modeling families with different assumptions: Random Forest, XGBoost, SVM, and a Neural Network. The goal is not only to maximize score, but to stress-test whether different model types converge on the same risk signal under a leakage-aware workflow. This notebook is built as a portfolio piece and as a practical ecological screening experiment: if we can detect higher-risk countries early, policy attention can be prioritized before irreversible forest loss accelerates. The rest of the notebooks prior to this, where data comes from and where it was cleaned, can be found in the ML and AI porfolio.
The modeling dataset (classified_deforestation.csv) contains country-level records with engineered socioeconomic and demographic predictors and a binary target:
Deforestation_Critical = 1 for high-risk deforestation jurisdictions and 0 otherwise.82 train, 21 test).The notebook uses ~20 predictors in the final modeling matrix (after cleaning and preparation done in prior phases). The ataset has been previously explored and cleaned, to avoid the overclesaning and also be able to tst different a´pproaches on the same dataset as new concepts and model are learnt.
StandardScaler fit on train only).GridSearchCV (F1 scoring) for Random Forest, XGBoost, and SVM.monitor='val_loss', restore_best_weights=True).| Model | Test F1-Score |
|---|---|
| SVM | 0.7273 |
| Neural Network | 0.6667 |
| Random Forest | 0.5000 |
| XGBoost | 0.2857 |
These are the direct outputs from the current notebook execution (DeforestationClassifier_SVMsANNsEnsemble.ipynb).
In strict evaluation practice, the held-out test set should be used once for final reporting after model selection is finalized through cross-validation/validation data. In this notebook, test-set comparison is used as a practical benchmark table. If this is for formal reporting, present this clearly and avoid claiming the test-selected top model as an unbiased final winner.
DeforestationClassifier_SVMsANNsEnsemble.ipynb: Main notebook with preprocessing, tuning, training, and model comparison.classified_deforestation.csv: Classification-ready dataset.DeforestationClassifier_SVMsANNsEnsemble.html: Readable main notebook with all teh processing, models, etc in html formatPython · pandas · numpy · scikit-learn · xgboost · tensorflow/keras · matplotlib
I built this project as part of my ongoing deforestation modeling series because I wanted to push beyond a single algorithm and test how robust the risk signal is across very different model types. For me, this is where data science gets meaningful: not just fitting models, but asking whether the conclusions still hold when the math changes.