deforestation-data-insights

🌳 Deforestation and Its Hidden Drivers

A regression-based analysis examining which country-level socio-economic indicators are associated with annual deforestation.

Each year, millions of hectares of forest are lost worldwide. While deforestation is often linked to agriculture or development, the broader socio-economic patterns behind it are less obvious. This project merges two public datasets (World Bank indicators + Our World in Data deforestation records) across 106 countries to model and predict annual forest loss.

Two modeling approaches are used:

The goal is not only prediction, but interpretation: understanding which structural factors are consistently associated with deforestation across countries.


Data Dictionary

Variable Source Description
Deforestation Our World in Data Target — hectares of forest lost per year
Urban_population World Bank People living in urban areas
Physicians per thousand World Bank Doctors per 1,000 people (proxy for institutional quality)
Density (P/Km²) World Bank Population density
Total tax rate World Bank Tax burden on businesses (% of commercial profits)
Labor force participation (%) World Bank Share of working-age population employed
Gasoline Price World Bank Average retail price USD/liter
Latitude / Longitude World Bank Geographic position of country centroid
CPI Change (%) World Bank Annual inflation rate
Infant mortality World Bank Deaths per 1,000 live births

Full data dictionary with all (the starting 35 variables) the relevant 22 variables is included in the notebook.


Methodology

  1. Data cleaning and harmonization across sources
  2. Feature engineering and skewness handling
  3. Train / validation split
  4. Baseline linear regression
  5. LASSO feature selection
  6. Random Forest modeling
  7. Model comparison on validation metrics
  8. OLS inference on selected predictors

Results Snapshot

  Linear (LASSO) Random Forest
Validation R² 0.475 ~0.545
Validation RMSE 351,456 ha 71,279 ha
Validation MAE 114,176 ha 38,604 ha

Key Insights


Limitations


Files

File Description
deforestation-regression.ipynb Full Jupyter Notebook. Analysis, models, inference, conclusions
deforestation-regression.html Static HTML export. Readable without running code
data/world-data/world-data-2023.csv World Bank socio-economic indicators (via Kaggle)
data/deforestation/annual-deforestation.csv Annual deforestation by country (Our World in Data)
data/deforestation/annual-deforestation.metadata.json Metadata for the deforestation dataset
dashboard.py Streamlit visualization app with a slider for predictions based on the regression models built (linear and random forest predictions)

Interactive Dashboard (In process - beta)

An optional Streamlit dashboard (dashboard.py) is included as a predictive visual explorer. It loads the raw data, retrains both models, and lets you adjust sliders for each of the 9 features to see how predicted deforestation changes in real time.

To run it:

pip install streamlit plotly scikit-learn pandas numpy
streamlit run dashboard.py

It opens in your browser at localhost:8501. Models train automatically on first load (~2 seconds) and stay cached while the app is running.

Important — error range: the predictions shown carry a large margin of error. The Random Forest validation MAE is ~38,600 ha and the linear model’s is ~114,000 ha. For context, most countries in the dataset deforest between 4,000 and 76,000 ha/year — so predictions for low-deforestation countries are rough estimates, not precise figures. The dashboard is best used to explore the direction of relationships (what happens when urban population increases, or physicians per thousand drops) rather than to produce exact forecasts.

Mexico Example with the data gotten from the csv and got a 52,000 aproximate prediction of hectares per year. Design of the streamlit visual:

Screen Shot 2026-02-19 at 0 24 35


Tools

Python · pandas · numpy · scikit-learn · statsmodels · matplotlib · seaborn · streamlit · plotly