DeforestationLogisticClassifier

Deforestation Logistic: Classifying Countries at Critical Forest Loss Risk

πŸ“– Purpose of Study

Can socioeconomic and demographic data alone identify which countries are actively destroying their forests?

This project transitions from the continuous regression approach developed in the previous phase into a binary classification problem, using Logistic Regression to predict whether a country is at critical deforestation risk or not. The central challenge is not just building a model; it is justifying the construction of a meaningful target variable from a continuous rate, handling a naturally imbalanced class distribution, and choosing an evaluation framework that reflects real ecological stakes. This repo demonstrates the full classification workflow: from defining the problem with scientific grounding, to cross-validating rigorously, to interpreting what the model actually learned about the world.

πŸ“Š The Data

The dataset (simplified_df.csv) is a pre-processed combination of environmental and socioeconomic country-level records (n = 103 countries after filtering non-forested nations), built and cleaned during the linear regression phase of this project series. The target variable Deforestation_Critical is derived from the annual deforestation rate.

Target Variable (constructed):

Predictor Features (selected):

πŸ›  Main Conceptual Applications

The core purpose of this notebook is to rigorously apply classification and model evaluation concepts to a real-world ecological problem. The key technical applications shown are:

  1. Target Variable Engineering: Constructing a binary class from a continuous variable using a literature-validated ecological threshold, with full justification for choosing it over statistical alternatives (median, mean, 75th percentile).
  2. Class Imbalance Handling: Using class_weight='balanced' in the Logistic Regression to counter the natural 70/30 split without resampling.
  3. Leakage-Free Evaluation via Pipeline: Wrapping StandardScaler and LogisticRegression into a make_pipeline() so the scaler is re-fitted inside each CV fold β€” not pre-applied to the entire training set.
  4. Stratified K-Fold Cross-Validation (k=5): Estimating model performance exclusively on training data, preserving class proportions in every fold.
  5. Decision Threshold Analysis: Simulating model behavior at thresholds of 0.3, 0.4, and 0.6 using out-of-fold CV probabilities to find the ecologically optimal cutoff.
  6. Diagnostic Metrics:
    • Recall, Precision, F1-Score: Evaluated for the minority class (High Risk).
    • Confusion Matrix: Visualized as heatmaps for both CV and test stages.
    • ROC Curve & AUC: Threshold-independent measure of discriminative quality.
  7. Coefficient Interpretation: Extracting and visualizing log-odds coefficients from the fitted pipeline to understand feature direction and relative importance.

πŸš€ Key Findings

The full analysis, decisions, and narrative can be found in the notebook. The main findings are:

Project Files

References

[1] Teo, H. C., Sarira, T. V., Tan, A. R. P., Cheng, Y., & Koh, L. P. (2024). Charting the future of high forest low deforestation jurisdictions. Proceedings of the National Academy of Sciences, 121(37), e2306496121. https://doi.org/10.1073/pnas.2306496121