A regression-based analysis examining which country-level socio-economic indicators are associated with annual deforestation.
Each year, millions of hectares of forest are lost worldwide. While deforestation is often linked to agriculture or development, the broader socio-economic patterns behind it are less obvious. This project merges two public datasets (World Bank indicators + Our World in Data deforestation records) across 106 countries to model and predict annual forest loss.
Two modeling approaches are used:
The goal is not only prediction, but interpretation: understanding which structural factors are consistently associated with deforestation across countries.
| Variable | Source | Description |
|---|---|---|
Deforestation |
Our World in Data | Target — hectares of forest lost per year |
Urban_population |
World Bank | People living in urban areas |
Physicians per thousand |
World Bank | Doctors per 1,000 people (proxy for institutional quality) |
Density (P/Km²) |
World Bank | Population density |
Total tax rate |
World Bank | Tax burden on businesses (% of commercial profits) |
Labor force participation (%) |
World Bank | Share of working-age population employed |
Gasoline Price |
World Bank | Average retail price USD/liter |
Latitude / Longitude |
World Bank | Geographic position of country centroid |
CPI Change (%) |
World Bank | Annual inflation rate |
Infant mortality |
World Bank | Deaths per 1,000 live births |
Full data dictionary with all (the starting 35 variables) the relevant 22 variables is included in the notebook.
| Â | Linear (LASSO) | Random Forest |
|---|---|---|
| Validation R² | 0.475 | ~0.545 |
| Validation RMSE | 351,456 ha | 71,279 ha |
| Validation MAE | 114,176 ha | 38,604 ha |
| File | Description |
|---|---|
deforestation-regression.ipynb |
Full Jupyter Notebook. Analysis, models, inference, conclusions |
deforestation-regression.html |
Static HTML export. Readable without running code |
data/world-data/world-data-2023.csv |
World Bank socio-economic indicators (via Kaggle) |
data/deforestation/annual-deforestation.csv |
Annual deforestation by country (Our World in Data) |
data/deforestation/annual-deforestation.metadata.json |
Metadata for the deforestation dataset |
dashboard.py |
Streamlit visualization app with a slider for predictions based on the regression models built (linear and random forest predictions) |
An optional Streamlit dashboard (dashboard.py) is included as a predictive visual explorer. It loads the raw data, retrains both models, and lets you adjust sliders for each of the 9 features to see how predicted deforestation changes in real time.
To run it:
pip install streamlit plotly scikit-learn pandas numpy
streamlit run dashboard.py
It opens in your browser at localhost:8501. Models train automatically on first load (~2 seconds) and stay cached while the app is running.
Important — error range: the predictions shown carry a large margin of error. The Random Forest validation MAE is ~38,600 ha and the linear model’s is ~114,000 ha. For context, most countries in the dataset deforest between 4,000 and 76,000 ha/year — so predictions for low-deforestation countries are rough estimates, not precise figures. The dashboard is best used to explore the direction of relationships (what happens when urban population increases, or physicians per thousand drops) rather than to produce exact forecasts.
Mexico Example with the data gotten from the csv and got a 52,000 aproximate prediction of hectares per year. Design of the streamlit visual:
Python · pandas · numpy · scikit-learn · statsmodels · matplotlib · seaborn · streamlit · plotly