Survey data

5 Survey data

We have already had a glimpse of the household survey data, when looking at the GPS coordinates of the households. However, we need to take a few more steps before continuing: We need to combine the GPS data with poverty data, and then aggregate it to the admin 4 level for estimating the SAE model.

5.1 Getting the survey data ready

The first step is to load both the household GPS data and the poverty data,¹ remove households with missing coordinates or that have coordinates of (0, 0), and then join the two household datasets together.

# load both datasets
df <- read_dta("data/ihshousehold/householdgeovariables_ihs5.dta")
pov <- read_dta("data/ihshousehold/ihs5_consumption_aggregate.dta")
# remove missing or 0 coordinates from df
df <- df |>
  filter(!is.na(ea_lon_mod), ea_lon_mod!=0)
# just keep the things we want
df <- df |>
  select(case_id, ea_lon_mod, ea_lat_mod)
pov <- pov |>
  select(case_id, hhsize, hh_wgt, poor)
# now join pov to df
df <- df |>
  left_join(pov, by = "case_id")
head(df)

# A tibble: 6 × 6
  case_id      ea_lon_mod ea_lat_mod hhsize hh_wgt  poor
  <chr>             <dbl>      <dbl>  <dbl>  <dbl> <dbl>
1 101011000014       33.2      -9.70      4   93.7     1
2 101011000023       33.2      -9.70      4   93.7     0
3 101011000040       33.2      -9.70      4   93.7     1
4 101011000071       33.2      -9.70      5   93.7     0
5 101011000095       33.2      -9.70      5   93.7     0
6 101011000115       33.2      -9.70      2   93.7     0

We have used left_join() from tidyverse to join the two datasets together. We now need to turn our new object into a spatial object, which we can do using the terra package (as we did in Section 3.3). We will then extract the information from the mw4 shapefile.

# turn into spatial object
df <- vect(df, geom = c("ea_lon_mod", "ea_lat_mod"), crs = "EPSG:4326")
# load mw4
mw4 <- vect("data/mw4.shp")
# make sure they are in the same CRS
mw4 <- project(mw4, crs(df))
# extract information
extracted <- extract(mw4, df)
# add to df, except for first column
df <- cbind(df, extracted[,-1])
head(df)

       case_id hhsize  hh_wgt poor DIST_CODE  EA_CODE TA_CODE
1 101011000014      4 93.7194    1       101 10101006   10101
2 101011000023      4 93.7194    0       101 10101006   10101
3 101011000040      4 93.7194    1       101 10101006   10101
4 101011000071      5 93.7194    0       101 10101006   10101
5 101011000095      5 93.7194    0       101 10101006   10101
6 101011000115      2 93.7194    0       101 10101006   10101

We now have our household data, which includes the household size, household weights, poverty indicator, TA code (admin 3 code), and EA code (admin 4 code). Now let’s aggregate to the admin 4 level:

df <- as_tibble(df) |> # this takes df out of a "spatial" object
  group_by(EA_CODE, TA_CODE) |> # this is the admin 4 and admin 3 identifier
  # summarize will aggregate up to the EA/TA (so the EA)
  summarize(poor = weighted.mean(poor, hhsize*hh_wgt),
    total_weights = sum(hhsize*hh_wgt)) |>
  ungroup()
head(df)

# A tibble: 6 × 4
  EA_CODE  TA_CODE   poor total_weights
  <chr>    <chr>    <dbl>         <dbl>
1 10101006 10101   0.243          6560.
2 10101011 10101   0.468          8953.
3 10101027 10101   0.0959        10582.
4 10101033 10101   0.384          8571.
5 10101039 10101   0.591         10612.
6 10101054 10101   0.494          6211.

Our new df object is now at the admin 4 level and has mean poverty rates, total household weights, and identifiers for both the admin 3 and admin 4 levels. This will serve as the “sample” for our small area model.

What do we need? We need a dataset with:

Admin identifier
Outcome of interest (e.g. expenditures)

We can then merge this with geospatial data, at the admin 4 level in this case, to estimate an SAE model.

Footnotes

The raw survey data also includes information on expenditures, which the Malawian NSO uses to calculate poverty indicators for each household.↩︎