Health Indices and Geospatial Data

September 1, 2023

Maternal and Child Health Prediction Using Satellite Imagery

Project Overview

This data science project explores the use of machine learning with satellite imagery and publicly available geotagged data to estimate key maternal and child health (MCH) indicators in low- and middle-income countries (LMICs). The research addresses the challenge of limited, expensive, and outdated MCH data by leveraging satellite-derived features as predictors.

Problem Statement

Traditional MCH indicator collection faces several limitations:

Death registers are unreliable in most LMICs
Demographic and Health Surveys (DHS) are expensive and nationally representative but sample only <2% of villages/neighborhoods
Survey estimates are often outdated due to infrequent collection cycles
Low spatial and temporal resolution limits granular insights

Data Sources

Ground Truth Data

USAID Demographic and Health Surveys (DHS): 114,113 surveys collected between 1986-2019 with up to 7 indicators per survey
Target MCH Indicators:
- Mean/Median BMI
- Under Five Mortality Rate
- Unmet Need Rate
- Skilled Birth Attendant Rate
- Stunted Growth Rate

Feature Datasets

1. Google Earth Engine Dataset

120,985 data points × 11,945 features
Satellite-derived geo-tagged variables collected through Google Earth Engine API

2. MOSAIKS Dataset

100,960 data points × 4,000 features
Spatial encodings collected through MOSAIKS API from UC Berkeley project

3. Landsat Image Dataset

Raw satellite imagery from NASA and U.S. Geological Survey (USGS)
75,000 examples × 11 features
8 spectral channels (bands) + 3 scalar features

Methodology

Extracted Feature Analysis

Regression Models:

Microsoft Azure AutoML for automated model selection and training
Stack and Voting Ensembles combining multiple algorithms:
- XGBoostRegressor
- LightGBM
- ExtremeRandomTrees
- RandomForest
- ElasticNet

Classification Models:

K-means discretization (k=5 clusters) of continuous health indicators
AutoML with Deep Learning using Stack Ensemble:
- LightGBM
- XGBoostClassifier

Raw Image Analysis

Custom machine learning models for Landsat imagery
Convolutional Neural Networks (CNN): Base model architecture
Vision Transformers (ViT): Advanced transformer-based approach

Results

Regression Performance (R² Scores)

Health Indicator	GEE Dataset	MOSAIKS Dataset	Combined Dataset
Median BMI	0.623	0.450	0.627
Unmet Need Rate	0.561	0.280	0.566
Under Five Mortality Rate	0.503	0.251	0.468
Skilled Birth Attendant Rate	0.657	0.310	0.668
Stunted Rate	0.324	0.194	0.335

Classification Accuracy

Health Indicator	GEE Dataset	MOSAIKS Dataset
Median BMI	52%	53%
Unmet Need Rate	50%	48%
Under Five Mortality Rate	48%	45%
Skilled Birth Attendant Rate	68%	55%

Raw Image vs. Extracted Features Comparison

Skilled Birth Attendant Rate:
- Raw Image Base Model: R² = 0.583
- Vision Transformer Model: R² = 0.414
- Extracted Feature Model: R² = 0.668

Key Findings

Dataset Combination Benefits: Combining Google Earth Engine and MOSAIKS datasets improved performance over individual datasets
Indicator-Specific Performance: Certain health indicators (Median BMI, Skilled Birth Attendant Rate) showed better predictability from satellite data
Classification Limitations: Reformulating as classification was disadvantageous without careful discretization consideration
Feature Engineering Advantage: Extracted features outperformed raw image analysis, suggesting the importance of domain-specific feature engineering

Technical Validation

The model underwent sanity checking through urban/rural location classification, demonstrating the ability to capture meaningful spatial patterns from satellite imagery.

Future Directions

Data Expansion: Incorporate Living Standard Measurement Studies (LSMS) for additional ground truth
Feature Engineering: Improve featurization and address missing data sources
Model Architecture: Explore deeper networks and time-variation modeling
Temporal Analysis: Consider temporal changes in satellite imagery for improved predictions

Impact and Applications

This research demonstrates the potential for satellite imagery to augment traditional survey data for MCH monitoring, offering:

Cost-effective large-scale health monitoring
Higher spatial resolution insights
More timely indicator updates
Support for targeted public health interventions in underserved areas

The project establishes a foundation for using remote sensing technology to address critical gaps in global health data collection, particularly benefiting resource-constrained settings where traditional survey methods are challenging to implement.

I was funded to work on this project alongside Emily Wiesel and Mac Ya through the Stanford Data Science for Social Good Fellowship. Our mentor was postdoctoral fellow Haojie Wang.