Selecting features

6 Creating and selecting features

Once we have all of the geospatial features, we now need to think about choosing which features we want to use in our model and whether we want to use our raw features to create new indicators. Choosing features is particularly important with geospatial data due to the sheer number of features we can create. We have already created a number of features, but we can also create new features, transform existing features, and use lasso to select features. We will discuss each of these in turn. In this section, we will be using the final, cleaned version of features in the data/geovarseas.csv file.

6.1 Creating new features e.g. admin means

We will be estimating models at the admin-4 level. However, this does not mean that all of our predictors must necessarily be at the admin-4 level. For example, we can create means or standard deviations at higher levels (e.g. admin 3). We can also create lagged means or standard deviations at the admin 4 level, as well as interactions between different features.

As a simple example, consider creating total population at the admin-3 level, in addition to the admin-4 level. Let’s first load the features into R and do a bit more cleaning:

# load features
features <- read_csv("data/geovarseas.csv")
# the population column is mwpop
head(features)
# A tibble: 6 × 526
   EA_CODE TA_CODE mwpop average_masked barecoverfraction urbancoverfraction
     <dbl>   <dbl> <dbl>          <dbl>             <dbl>              <dbl>
1 10507801   10507  700.          0.585               407                769
2 10507072   10507  940.          0                   904                114
3 10507010   10507 1077.          0                  1525               3101
4 10507001   10507  610.          0                   673               1919
5 10507009   10507  558.          0                  1383                957
6 10507033   10507  594.          0                   738               1802
# ℹ 520 more variables: cropscoverfraction <dbl>, grasscoverfraction <dbl>,
#   mosscoverfraction <dbl>, waterpermanentcoverfraction <dbl>,
#   waterseasonalcoverfraction <dbl>, shrubcoverfraction <dbl>,
#   snowcoverfraction <dbl>, treecoverfraction <dbl>, ndvi1 <dbl>, ndvi2 <dbl>,
#   ndvi3 <dbl>, ndvi4 <dbl>, ndvi5 <dbl>, ndvi6 <dbl>, ndvi7 <dbl>,
#   ndvi8 <dbl>, ndvi9 <dbl>, ndvi10 <dbl>, ndvi11 <dbl>, ndvi12 <dbl>,
#   mosaik1 <dbl>, mosaik2 <dbl>, mosaik3 <dbl>, mosaik4 <dbl>, …

Let’s now create one more variable, which is total population at the admin-3 level:

# create TOTAL pop at admin3 level:
features <- features |>
  group_by(TA_CODE) |>
  mutate(popTA = sum(mwpop, na.rm = TRUE)) |>
  ungroup()

Here, we take advantage of the tidyverse’s group_by() and mutate() functions to aggregate total population up to the admin-3 level, but keeping the dataset itself at the admin-4 level.

As another example, let’s consider NDVI, which is a measure of vegetation (“greenness”). In countries like Malawi, NDVI can be highly predictive of poverty due to its correlation with the agricultural harvest. However, at the same time, the exact timing of NDVI measures can be very important for capturing this relationship, due to the seasonality inherent in rain-fed agricultural production. In practice, we often pull historical NDVI and then include things like long-term mean, long-term max, long-term min, and even long-term standard deviation, in addition to current values.

In our simple example, we downloaded 12 separate NDVI files1 and we can use these to calculate things like min/mean/sd of NDVI throughout the year. In the features object, the NDVI columns are named ndviM, where M is an integer from 1 to 12, indicating the month of the year.

Let’s create some new values, such as the annual mean, max, min, and standard deviation:

# Let's first find the columns we want!
ndvicols <- grep("ndvi", names(features))
features$ndvimean <- apply(features[,ndvicols], 1, mean, na.rm = TRUE)
features$ndvistd <- apply(features[,ndvicols], 1, sd, na.rm = TRUE)      
features$ndvimax <- apply(features[,ndvicols], 1, max, na.rm = TRUE)      
features$ndvimin <- apply(features[,ndvicols], 1, min, na.rm = TRUE)   
head(features[,c("ndvimean", "ndvistd", "ndvimax", "ndvimin")])
# A tibble: 6 × 4
  ndvimean ndvistd ndvimax ndvimin
     <dbl>   <dbl>   <dbl>   <dbl>
1    9066.   2787.   13106    5681
2   23035.   8614.   36599   13413
3   24912.   8605.   37987   15263
4   31705    8246.   42600   20608
5   23144.   7180.   33499   14993
6   13229.   3927.   19218    8390

Here, we take advantage of a new function we have not used before: apply(). The apply() function is very useful. In this case, we are going to “apply” a single function to different columns of ndviextracted. Let’s go through each row in the above code:

  1. This row does something slightly more advanced. We want to find the location of the columns that contain the string “ndvi” in the column names. We can do this using grep(), which returns the index of the columns that contain the string “ndvi”. We can then use this index to find the columns we want.
  2. ndvimean: We are going to take the mean of the NDVI values for each row, which is what the 1 represents in the function call (if we wanted to apply it down columns, we would use a 2 instead). We use apply() to apply the mean() function to each row, excluding missing values. The next three rows do something similar, just calling a different function instead of the mean.
  3. Here we are looking at the first few rows of df, but only for the new columns we just created.

6.2 Thinking about transformations

In addition to creating new variables through aggregation to a higher level of geography or combining data over time, we can also transform both our (potential) predictor variables and our outcome variable, through, for example, log transformations. Why might we want to transform our variables? There are two primary reasons:

  1. Improve the predictive ability of the model. For example, a log transformation can sometimes make the relationship between a predictor and the outcome more linear, which is particularly important for linear models (which are the workhorse of SAE).
  2. Improve the properties of the residuals. In SAE, we often use a parametric bootstrap for point estimation and inference. In our case, we make two key assumptions: the residual is normally distributed and the random effects are normally distributed. Transforming the outcome variable, in particular, can sometimes make the residuals more normal, which can improve the properties of the bootstrap.

Let’s start with the outcome: poverty rates at the admin-4 level. Before diving into this, it is important to clarify that the assumptions we make in our SAE model are not about the distribution of the outcome itself, but rather about the residuals and random effects. Nonetheless, outcomes that are more “normally” distributed do tend to have better properties in terms of estimation.

Figure 1: Poverty rates in Northern Malawi

In the above figure, we show the density of poverty rates in Northern Malawi. The left panel shows the untransformed poverty rates, while the right panel shows the poverty rates after an arcsin (square root) transformation.2 We have found that this transformation performs particularly well when the outcome is a proportion, as it is in this case (it varies between 0 and 1). It is quite clear from the figure that the transformed outcome is more “normal” than the untransformed outcome. However, since we are really interested in the residuals, we discuss model diagnostics more below, where we also look at statistics for skewness and kurtosis of the residuals and random effects.

But since we are really interested in the residual, we can look at how a transformation might affect the predictive power of the model. To do this, we will use the feols() function from the package fixest.3

Before we do this, however, we need to recode some of the key predictors: land cover classifications. Right now, the land cover classification values are not true proportions! They are counts of pixels of different land classifications within each EA. We need to turn these into proportions. We can do this by dividing by the total number of pixels in each EA. Let’s do this:

# find columns we want
landcols <- grep("coverfraction", names(features))
# how many total pixels?
features$totalpixels <- apply(features[,landcols], 1, sum, na.rm = TRUE)
# go through each one and replace with proportion:
for (i in landcols){
  features[,i] <- features[,i]/features$totalpixels
}
# remove total pixels
features <- features |>
  select(-totalpixels)
summary(features[,landcols])
 barecoverfraction  urbancoverfraction cropscoverfraction grasscoverfraction
 Min.   :0.000000   Min.   :0.000000   Min.   :0.0000     Min.   :0.0000    
 1st Qu.:0.005634   1st Qu.:0.009176   1st Qu.:0.2756     1st Qu.:0.2449    
 Median :0.012542   Median :0.021913   Median :0.3707     Median :0.2660    
 Mean   :0.014802   Mean   :0.103282   Mean   :0.3428     Mean   :0.2533    
 3rd Qu.:0.021850   3rd Qu.:0.051555   3rd Qu.:0.4351     3rd Qu.:0.2904    
 Max.   :0.091131   Max.   :1.000000   Max.   :0.6276     Max.   :0.5254    
 mosscoverfraction waterpermanentcoverfraction waterseasonalcoverfraction
 Min.   :0         Min.   :0.0000000           Min.   :0.000000          
 1st Qu.:0         1st Qu.:0.0000000           1st Qu.:0.000000          
 Median :0         Median :0.0000000           Median :0.000000          
 Mean   :0         Mean   :0.0008774           Mean   :0.001152          
 3rd Qu.:0         3rd Qu.:0.0000000           3rd Qu.:0.000000          
 Max.   :0         Max.   :0.1184433           Max.   :0.161355          
 shrubcoverfraction snowcoverfraction treecoverfraction
 Min.   :0.0000     Min.   :0         Min.   :0.00000  
 1st Qu.:0.1160     1st Qu.:0         1st Qu.:0.08103  
 Median :0.1397     Median :0         Median :0.11777  
 Mean   :0.1307     Mean   :0         Mean   :0.15302  
 3rd Qu.:0.1594     3rd Qu.:0         3rd Qu.:0.19193  
 Max.   :0.2394     Max.   :0         Max.   :0.87709  

The above output also provides some additional context for selecting the variables for our SAE models. In our SAE application below, we will be estimating a model using maximum likelihood estimation. In such cases, we can sometimes experience convergence issues, especially when some predictors have very little variation, In this example, several of the land cover classification variables have no variation – so they will not be selected at all in the lasso application we discuss below – but others show very little variation. Let’s remove all columns that have very little variation:

features <- features |>
  select(-c("mosscoverfraction", "waterpermanentcoverfraction", 
    "waterseasonalcoverfraction", "snowcoverfraction"))

Let’s consider the three regressions, with poverty on the left-hand side and two separate predictors, cropscoverfraction and mwpop, on the right-hand side. We estimate three separate (simple) regressions with different transformations. In column one, all variables – including the outcome – are not transformed. In column two, we transform the outcome only, leaving both of the predictors untransformed. In column three, we transform both the outcome and the predictors, using the arcsin transformation for crop cover and the log transformation for population. The results are in Table 1.

# new data
pov <- read_csv("data/ihs5ea.csv")
# not we join pov INTO features. This means we have all admin 4 areas, with or without sample data
pov <- features |>
  left_join(pov, by = "EA_CODE") |>
  mutate(mwpop = mwpop/1000)

reg1 <- feols(poor ~ cropscoverfraction + mwpop, data = pov, weights = ~total_weights, vcov = "HC1")
reg2 <- feols(asin(sqrt(poor)) ~ cropscoverfraction + mwpop, data = pov, weights = ~total_weights, vcov = "HC1")
reg3 <- feols(asin(sqrt(poor)) ~ asin(sqrt(cropscoverfraction)) + log(mwpop), data = pov, weights = ~total_weights, vcov = "HC1")
Table 1: Variable transformations

Transforming just the outcome improves the fit, at least as measured by r-squared, by around 7.2 percent. All of the transformations, however, increase the predictive power of the regression by 27.8 percent. In other words, the transformations of just these two variables, along with the outcome, can lead to a substantial improvement in the model.4

6.3 lasso and glmnet

In the above sections, we have shown how transforming our predictor and outcome variables can improve the fit of a model. However, we have not yet discussed how to choose which features to actually include in the model. Including all potential predictors has two primary problems:

  1. First, we often have more predictors than observations. In this case, it is impossible to estimate the model.
  2. Even if we have more observations than predictors, including all predictors can lead to overfitting. In essence, our model can end up predicting noise in the data, rather than true underlying relationships. This will lead to very poor performance, especially when predicting into out-of-sample data.

There are many ways to select predictors, but the most commonly used method is “lasso”.5 Lasso is a regression method that “penalizes” coefficients. To put it simply, it will shrink coefficients to zero if they do not meaningfully improve the performance of the model. In practice, this approximately equalizes in-sample and out-of-sample r-squared.

In its simplest form, lasso is a linear regression model with an additional penalty term. We minimize the following objective function with respect to \(\beta\):

\[ (y-X\beta)^2 + \lambda \sum_{j=1}^p |\beta_j|, \]

where \(\lambda\) is a tuning parameter that determines the size of the “penalty” and \((y-X\beta)^2\) is the usual ordinary least squares minimiazation problem. Importantly, the penalty term can in principle take on any (non-negative) value. When \(\lambda=0\), lasso is just a linear regression. As \(\lambda\) increases, fewer and fewer variables will be selected (i.e. will have non-zero coefficients). So what is the “correct” value for \(\lambda\)? In practice, we often select \(\lambda\) using cross validation.

Figure 2: Cross validation set up

Cross validation is a process that is designed to mimic out-of-sample estimation. Consider the setup in Figure 2. We first take the sample data and randomize it into N \(folds\); the most common number of folds is 10, but the diagram shows five for simplicity. Since we are trying to mimic out-of-sample performance, we estimate a model on \(N-1\) folds and then predict on the remaining fold. We then calculate the mean squared error (MSE) of the prediction. We repeat this process N times, each time leaving out a different fold. We can then find the mean MSE across all of the folds. We can repeat this process many times, for different values of \(\lambda\). The final (“optimal”) value of \(\lambda\) is the one that minimizes the mean MSE across folds.

Thankfully, we do not need to do all of this by hand. Instead, we can use the cv.glmnet() function from the glmnet package. This function will automatically assign observations to folds and calculate MSE for different values of \(\lambda\). To do this, glmnet needs to be installed, which can be done using the install.packages("glmnet") function, and then loading the library with library(glmnet).

The following code illustrates this process using already cleaned data. The cleaned features are in the geovars.csv data and the outcome is in the ihs5ea.csv data. Both datasets are again from Northern Malawi and are already collapsed to the admin4 (EA) level. First, load both datasets:

# load poverty
pov <- read_csv("data/ihs5ea.csv")
# load features
features <- read_csv("data/geovarseas.csv")
# add features to pov
pov <- pov |>
  left_join(features, by = "EA_CODE")
head(pov)
# A tibble: 6 × 530
   EA_CODE   poor total_weights total_obs  fold TA_CODE mwpop average_masked
     <dbl>  <dbl>         <dbl>     <dbl> <dbl>   <dbl> <dbl>          <dbl>
1 10101006 0.230          5690.        16     4   10101 1123.          0.519
2 10101011 0.444          7614.        16     2   10101 1224.          0    
3 10101027 0.0947         9441.        16     5   10101 1292.          4.27 
4 10101033 0.376          7486.        16     4   10101  672.          0    
5 10101039 0.600          9147.        16     1   10101 1056.          0    
6 10101054 0.497          5351.        16     5   10101  864.          0    
# ℹ 522 more variables: barecoverfraction <dbl>, urbancoverfraction <dbl>,
#   cropscoverfraction <dbl>, grasscoverfraction <dbl>,
#   mosscoverfraction <dbl>, waterpermanentcoverfraction <dbl>,
#   waterseasonalcoverfraction <dbl>, shrubcoverfraction <dbl>,
#   snowcoverfraction <dbl>, treecoverfraction <dbl>, ndvi1 <dbl>, ndvi2 <dbl>,
#   ndvi3 <dbl>, ndvi4 <dbl>, ndvi5 <dbl>, ndvi6 <dbl>, ndvi7 <dbl>,
#   ndvi8 <dbl>, ndvi9 <dbl>, ndvi10 <dbl>, ndvi11 <dbl>, ndvi12 <dbl>, …

We need to ensure we know which column includes the outcome of interest (in this case, poor) and which columns include all of the predictors. The data geovarseas.csv data has been cleaned such that there are only two non-predictors: the admin4 identifier (EA_CODE) and the admin3 identifier (TA_CODE). This means that all columns from mwpop to ncol(pov) are the predictors. We should be very specific about what Y and X are in this case, before using cv.glmnet.6 In addition, recall that we discussed using a transformation of the outcome variable, specifically. Let’s also do that here:

Y <- as.vector(asin(sqrt(pov$poor)))
# column 6 is the location of mwpop
X <- as.matrix(pov[,7:ncol(pov)])
# here is the cross validation to select lambda
set.seed(234056) # set seed for consistent results!
# five folds to keep with the simple example
cvresults <- cv.glmnet(x = X, y = Y, nfolds = 5)
cvresults

Call:  cv.glmnet(x = X, y = Y, nfolds = 5) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01986    35 0.04466 0.002446      10
1se 0.04180    19 0.04697 0.002626       4

The cvresults object contains information about two different options for \(\lambda\): the value of \(\lambda\) that minimizes MSE (lambda.min) and the value of \(\lambda\) that is a bit larger (lambda.1se), meaning it is more conservative and will usually lead to fewer non-zero coefficients. In this example, the number of non-zero coefficients is either 10 or 4, depending on which \(\lambda\) we choose.7 We can use the plot() function to see the results, with the results in Figure 3 (the dotted lines represent the two “optimal” values of \(\lambda\)).

plot(cvresults)
Figure 3: CV results

The next step is the most important: extracting the variables that have non-zero coefficients. While seeing the coefficients is straightforward, extracting the names of the variables is unfortunately more difficult than it probably should be.

To simply view the coefficients, we can use the coef() function, as shown below. Note that the code uses head() to keep the size of the R output to a managable size for this guide:8

head(coef(cvresults))
6 x 1 sparse Matrix of class "dgCMatrix"
                              s1
(Intercept)         5.752724e-01
mwpop               .           
average_masked     -1.427754e-02
barecoverfraction   3.923403e-05
urbancoverfraction  .           
cropscoverfraction  .           

The coefficients showing a . have zero value. We can use this fact to extract the names of the variables with non-zero coefficients. We can see this if we turn the coef() result into a matrix:9

head(as.matrix(coef(cvresults)))
                              s1
(Intercept)         5.752724e-01
mwpop               0.000000e+00
average_masked     -1.427754e-02
barecoverfraction   3.923403e-05
urbancoverfraction  0.000000e+00
cropscoverfraction  0.000000e+00

After turning the results into a matrix, we can see that the . coefficients become zeros. With this in mind, we can finally extract the names of the non-zero coefficients, as follows:

# we can specify the lambda we want to use; "lambda.min" or "lambda.1se"
nonzero <- as.matrix(coef(cvresults, "lambda.min"))
nonzero <- rownames(nonzero)[nonzero[,1]!=0]
nonzero
 [1] "(Intercept)"        "average_masked"     "barecoverfraction" 
 [4] "urbancoverfraction" "mosaik1"            "mosaik39"          
 [7] "mosaik156"          "mosaik268"          "mosaik396"         
[10] "mosaik459"          "mosaik464"         
# now remove the intercept (since this is automatically added)
nonzero <- nonzero[-1]
nonzero
 [1] "average_masked"     "barecoverfraction"  "urbancoverfraction"
 [4] "mosaik1"            "mosaik39"           "mosaik156"         
 [7] "mosaik268"          "mosaik396"          "mosaik459"         
[10] "mosaik464"         

In this example, we have three non-zero coefficients, but we want to remove the intercept, which is always the first coefficient. We remove the intercept because it will automatically be added by the SAE function that we use later. This is done with the nonzero <- nonzero[-1] line of code above.

The final step is to turn this into a formula, of the form \(outcome ~ x_1 + \cdots + x_n\), where \(x_1, \ldots, x_n\) are the non-zero coefficients. We can do this with the paste() function and the collapse option, as follows:

ebpformula <- as.formula(paste("poor ~ ", paste(nonzero, collapse = " + ")))
ebpformula
poor ~ average_masked + barecoverfraction + urbancoverfraction + 
    mosaik1 + mosaik39 + mosaik156 + mosaik268 + mosaik396 + 
    mosaik459 + mosaik464

The collapse = " + " option tells paste() to separate the variables with a + sign. Finally, we have to explicitly tell R that this is a formula, which we do using as.formula(). This is the formula we will use below in section Section 7.

Footnotes

  1. This is the for loop at the end of the geospatialpull.ipynb notebook.↩︎

  2. The arcsin square root transformation is defined as \(y^*=\sin^{-1}(\sqrt{y})\).↩︎

  3. You can also use the lm() function from base R. However, we prefer the feols() function because it makes some things much easier, like adding weights.↩︎

  4. We note that we only show these results for illustrative purposes. In practice, we do not use r-squared as a measure of model fit, especially when only looking within the sample.↩︎

  5. Lasso stands for “least absolute shrinkage and selection operator,” but it is often simply refered to by its acronym.↩︎

  6. A small note: the outcome needs to be a vector and the predictors need to be in the form of a matrix. Both of these requirements are specified in the code.↩︎

  7. Note that we need to set a seed if we want consistent results when re-running the cross validation. This is because cross validation relies on some degree of randomness (in the allocation across folds).↩︎

  8. The two non-negative variables visible in the output are nightlights (average_masked) and the proportion of the area that is bare ( barecoverfraction). Both coefficients are in the expected direction.↩︎

  9. Again, head() is used to limit the output to the first few variables for practical purposes.↩︎