Can socioeconomic and demographic data alone identify which countries are actively destroying their forests?
This project transitions from the continuous regression approach developed in the previous phase into a binary classification problem, using Logistic Regression to predict whether a country is at critical deforestation risk or not. The central challenge is not just building a model; it is justifying the construction of a meaningful target variable from a continuous rate, handling a naturally imbalanced class distribution, and choosing an evaluation framework that reflects real ecological stakes. This repo demonstrates the full classification workflow: from defining the problem with scientific grounding, to cross-validating rigorously, to interpreting what the model actually learned about the world.
The dataset (simplified_df.csv) is a pre-processed combination of environmental and socioeconomic country-level records (n = 103 countries after filtering non-forested nations), built and cleaned during the linear regression phase of this project series. The target variable Deforestation_Critical is derived from the annual deforestation rate.
Target Variable (constructed):
Deforestation_Critical: Binary classification label β 1 (High Risk: annual deforestation rate > 0.501% of forested area) or 0 (Low Risk). Threshold sourced from Teo et al., PNAS 2024 [1]. Approximately 70% Class 0 / 30% Class 1.Predictor Features (selected):
Agricultural Land (%): Share of total land under agricultural use β proxy for land conversion pressure.Birth Rate: Crude birth rate per 1,000 people β reflects demographic pressure on natural resources.Co2-Emissions: Annual COβ output β correlated with extractive and industrial activity.GDP: Total economic output β higher-income nations tend to invest in conservation.Life Expectancy: Average lifespan β proxy for institutional quality and human development.Gross Tertiary Education Enrollment (%): University enrollment rate β reflects educated, conservation-aware populations.Infant Mortality: Deaths per 1,000 live births β development indicator inversely correlated with environmental governance.Physicians per Thousand: Healthcare access β broader proxy for state institutional capacity.Urban Population: Share of population in urban areas β urbanization patterns affect land-use pressure.CPI, CPI Change (%), Tax Revenue (%), Unemployment Rate, and others: Macroeconomic context variables.Latitude / Longitude: Geographic position, captures tropical belt effects not explained by economics alone.The core purpose of this notebook is to rigorously apply classification and model evaluation concepts to a real-world ecological problem. The key technical applications shown are:
class_weight='balanced' in the Logistic Regression to counter the natural 70/30 split without resampling.StandardScaler and LogisticRegression into a make_pipeline() so the scaler is re-fitted inside each CV fold β not pre-applied to the entire training set.The full analysis, decisions, and narrative can be found in the notebook. The main findings are:
[1] Teo, H. C., Sarira, T. V., Tan, A. R. P., Cheng, Y., & Koh, L. P. (2024). Charting the future of high forest low deforestation jurisdictions. Proceedings of the National Academy of Sciences, 121(37), e2306496121. https://doi.org/10.1073/pnas.2306496121