We have already had a glimpse of the household survey data, when looking at the GPS coordinates of the households. However, we need to take a few more steps before continuing: We need to combine the GPS data with poverty data, and then aggregate it to the admin 4 level for estimating the SAE model.
5.1 Getting the survey data ready
The first step is to load both the household GPS data and the poverty data,1 remove households with missing coordinates or that have coordinates of (0, 0), and then join the two household datasets together.
# load both datasetsdf <-read_dta("data/ihshousehold/householdgeovariables_ihs5.dta")pov <-read_dta("data/ihshousehold/ihs5_consumption_aggregate.dta")# remove missing or 0 coordinates from dfdf <- df |>filter(!is.na(ea_lon_mod), ea_lon_mod!=0)# just keep the things we wantdf <- df |>select(case_id, ea_lon_mod, ea_lat_mod)pov <- pov |>select(case_id, hhsize, hh_wgt, poor)# now join pov to dfdf <- df |>left_join(pov, by ="case_id")head(df)
We have used left_join() from tidyverse to join the two datasets together. We now need to turn our new object into a spatial object, which we can do using the terra package (as we did in Section 3.3). We will then extract the information from the mw4 shapefile.
# turn into spatial objectdf <-vect(df, geom =c("ea_lon_mod", "ea_lat_mod"), crs ="EPSG:4326")# load mw4mw4 <-vect("data/mw4.shp")# make sure they are in the same CRSmw4 <-project(mw4, crs(df))# extract informationextracted <-extract(mw4, df)# add to df, except for first columndf <-cbind(df, extracted[,-1])head(df)
We now have our household data, which includes the household size, household weights, poverty indicator, TA code (admin 3 code), and EA code (admin 4 code). Now let’s aggregate to the admin 4 level:
df <-as_tibble(df) |># this takes df out of a "spatial" objectgroup_by(EA_CODE, TA_CODE) |># this is the admin 4 and admin 3 identifier# summarize will aggregate up to the EA/TA (so the EA)summarize(poor =weighted.mean(poor, hhsize*hh_wgt),total_weights =sum(hhsize*hh_wgt)) |>ungroup()head(df)
Our new df object is now at the admin 4 level and has mean poverty rates, total household weights, and identifiers for both the admin 3 and admin 4 levels. This will serve as the “sample” for our small area model.
What do we need? We need a dataset with:
Admin identifier
Outcome of interest (e.g. expenditures)
We can then merge this with geospatial data, at the admin 4 level in this case, to estimate an SAE model.
Footnotes
The raw survey data also includes information on expenditures, which the Malawian NSO uses to calculate poverty indicators for each household.↩︎