๐ณ Deforestation Risk Classification Project
๐ Why This Matters
Deforestation is one of the most pressing environmental challenges of our time. Every year, 10 million hectares of forestโan area roughly the size of Icelandโare lost to logging, agriculture, and urban expansion. This destruction doesnโt just harm trees; it:
- ๐จ Accelerates climate change by releasing stored carbon
- ๐ฆ Destroys biodiversity and threatens countless species
- ๐ Disrupts water cycles and increases flooding risks
- ๐ฅ Displaces indigenous communities who depend on forests
The problem? By the time we detect deforestation through satellite imagery, the damage is already done.
Our solution? Use machine learning to predict deforestation risk BEFORE it happens, allowing conservationists to intervene proactively rather than reactively.
This project demonstrates how data science can transform environmental protection from a reactive practice into a predictive, preventative strategy.
๐ The Data
Dataset Overview
- Source: Global country-level environmental and socioeconomic indicators, and annual deforestation (OurWorldInData)
- Samples: 138 observations (countries/regions)
- Features: 29 environmental, economic, and social variables
- Target: Binary classification (High Risk vs. Low Risk)
- Class Distribution: ~30% High Risk, ~70% Low Risk (imbalanced)
Key Features Used
After feature engineering and selection, we focused on:
- ๐ฒ Density (Population/kmยฒ) - Human pressure on land
- ๐พ Agricultural Land (%) - Expansion pressure
- ๐ Absolute Latitude - Distance from equator (climate proxy)
- ๐ฐ Economic indicators - GDP, minimum wage, CPI
- ๐ Education levels - Primary education enrollment
- ๐ถ Social indicators - Infant mortality, life expectancy
- โฆ and additional features
Data Challenges
- Small dataset (n=138) - Risk of overfitting
- Class imbalance (70/30 split) - Required careful handling
- Mixed feature scales - Standardization
- High dimensionality - Feature selection was critical
๐ฌ Methodology
1. Data Preprocessing
โ Train/Test Split (80/20) with stratification
โ Feature engineering (Latitude โ Abs_Latitude)
โ Feature selection (dropped 3 low-importance features)
โ Standardization (for distance-based models)
2. Models Compared
We implemented and rigorously tested 5 different algorithms:
| Model |
Type |
Complexity |
Best For |
| Logistic Regression |
Linear |
Low |
Baseline, interpretability |
| Linear Discriminant Analysis (LDA) |
Linear |
Low |
Small datasets, assumes normality |
| Random Forest |
Ensemble (Trees) |
High |
Non-linear patterns |
| Support Vector Machine (SVM) |
Kernel-based |
Medium |
Small datasets, clear margins |
| Neural Networks |
Deep Learning |
High |
Complex patterns (2 architectures tested) |
3. Rigorous Evaluation Protocol
To ensure fair comparison and prevent data leakage, we followed academic best practices:
Phase 1: Model Comparison (Cross-Validation Only)
- โ
5-fold stratified cross-validation on training data
- โ
Hyperparameter tuning via GridSearchCV
- โ
Class weight optimization (handling imbalance)
- โ
Metrics: Recall, Precision, F1-Score, Accuracy
- โ NO test set evaluation (held out for final model only)
Phase 2: Final Model Selection
- Selected SVM based on highest F1-score (0.723) and best recall-precision balance
- Justification: Superior generalization for small datasets, mathematically stable decision boundary
Phase 3: Final Evaluation
- โ
Test set evaluation (only AFTER model selection)
- โ
Real-world performance estimation
- โ
Interpretable results for stakeholders
4. Special Considerations for Small Datasets
Given our limited data (n=138), we took special precautions:
- LDA: Used shrinkage regularization (0.9) to stabilize covariance matrix
- Random Forest: Aggressive depth limiting (max_depth=5) to prevent overfitting
- Neural Networks: Minimalist architectures (16โ8โ1), early stopping, class weighting
- SVM: Linear kernel (avoids overfitting from polynomial/RBF kernels)
Cross-Validation Results (Training Data)
| Model |
Recall |
Precision |
F1-Score |
Accuracy |
| SVM โญ |
80.0% |
68.5% |
72.3% |
81.9% |
| LDA |
80.0% |
65.4% |
70.8% |
79.1% |
| Logistic Regression |
80.0% |
46.5% |
58.8% |
62.8% |
| Random Forest |
56.0% |
67.3% |
57.7% |
76.3% |
| Neural Network (v1) |
56.0% |
65.5% |
54.8% |
74.0% |
| Neural Network (v2) |
56.0% |
51.7% |
52.0% |
67.6% |
โญ Winner: Linear SVM - Best overall F1-score and balanced performance (but in the end it is all contextual, we are just talking metrics for our main and target class aka the class 1)
Key Insights
โ
What Worked:
- Linear models (LR, LDA, SVM) outperformed complex models (RF, NN)
- Class weighting successfully addressed the 70/30 imbalance
- Shrinkage regularization prevented overfitting in LDA
- Simple architectures worked best for limited data
โ What Didnโt Work:
- Deep learning struggled with small dataset (overfitting risk)
- Random Forestโs complexity led to lower recall
- Aggressive class weights (balanced) caused precision collapse
๐ Domain Context:
- Recall prioritized: Missing a high-risk area (false negative) = ecological disaster
- Precision matters too: False alarms waste conservation resources
- F1-Score balance: SVMโs 72.3% represents optimal tradeoff
๐ ๏ธ Technologies Used
Core Libraries
- Python 3.8+ - Programming language
- NumPy & Pandas - Data manipulation and analysis
- Scikit-Learn - Machine learning models and pipelines
- TensorFlow/Keras - Neural network implementations
- Matplotlib & Seaborn - Static visualizations
Interactive Components
- Streamlit - Web app framework for Model Arena
- Plotly - Interactive visualizations and dashboards
- Jupyter Notebook - Analysis and experimentation
- Git - Version control
- Google Colab - Cloud computing (initial development)
Key Techniques
- Stratified K-Fold Cross-Validation
- Grid Search Hyperparameter Tuning
- Pipeline Architecture (prevent data leakage)
- Class Weight Balancing
- Regularization (L2, shrinkage)
- Early Stopping & Learning Rate Scheduling
๐ฎ Interactive App: Model Arena
Want to play with the models and predict deforestation risk yourself?
We built a fun, interactive Streamlit app with two modes:
๐ฅ Mode 1: Model Arena
โPick your fighters and watch them battle!โ
- Select any 2 models (e.g., SVM vs. Random Forest)
- Click โBATTLE!โ
- See head-to-head metric comparison
- Declares a winner based on F1-score
- Visual battle statistics chart
Perfect for understanding model tradeoffs!
๐ฎ Mode 2: Risk Predictor
โPredict deforestation risk for any region!โ
- Adjust sliders for all environmental/economic features
- Get instant HIGH RISK ๐จ or LOW RISK โ
prediction
- See confidence scores
- Get actionable recommendations
- Compare predictions across all models
Perfect for testing hypothetical scenarios!
How to Launch
streamlit run model_arena_app.py
Or just double-click:
Opens in browser at localhost
Project Files:
**Built with ๐ for the Planet**