16 Spatial Autocorrelation

Tobler’s First Law of Geography reminds us that everything is related to everything else, but near things are more related than distant things. Spatial autocorrelation describes how a variable is correlated with itself across geographic space. In other words, spatial autocorrelation can tell us how related near things are to each other in our dataset.

Point pattern analysis is focused on the location of events. In point patterns, the source of randomness is the location itself. In assessing autocorrelation, we are considering whether the value at locations is random.

There are two general scenarios in which we might want to analyze autocorrelation:

Areal units: Many sources of geographical data come aggregated into zones that may or may not be meaningful. For instance, census data is aggregated from the individual household level to statistical units (like census tracts). In these cases, the location of the units is pre-determined, so it is not meaningful to analyze location. Instead, it is useful to consider values at these fixed locations to determine if there are spatial patterns in the variable.
Locations: For some datasets, we might be interested both in the pattern of location AND the pattern in value. In these cases, both the location and value are meaningful– for instance, our lightning strike dataset. In this case, we might be interested in both where the lightning strikes are and something about the the strikes when they occur (for instance, are more powerful strikes located near each other).

Just like point pattern analysis, quantifying autocorrelation is useful because it helps us understand the spatial pattern/structure of our data, which helps us in generating hypotheses about the processes that generate the pattern.

In addition, spatial autocorrelation allows us to understand how spatially dependent our variable is. Spatial dependence can be a serious problem for applying traditional statistical methods because these methods generally have an assumption of independence. Measuring the strength of spatial dependence helps us evaluate how serious this issue is and decide whether we need to adjust our approach- for example, by modifying our sampling design, using methods that account for spatial structure, or interpreting results with greater caution.

In this chapter, we will explore autocorrelation in the Median Household Income variable from the 2023 American Community Survey for North Carolina census tracts and chronic absences at North Carolina public schools.

To follow along with this tutorial, make a new .Rmd document. As you move through the tutorial add chunks, headers, and relevant text to your document.

16.1 Reading in Data

library(tmap)
library(tidyverse)
library(spdep)
library(sf)

#Note that the crs 
#of the datasets is NAD83/ North Carolina, meaning that the units are feet
tracts_sf <- st_read("https://drive.google.com/uc?export=download&id=1_X3s6sTw5zeIuXa6wDQ61r8Px7_tI7Nr")

schools_sf <- st_read("https://drive.google.com/uc?export=download&id=1mUxk0R_lqcfuJppw_A-1vy-pcB9XdCnS")

16.2 Defining Neighbors

In order to quantify spatial autocorrelation, we need to define each observation’s “spatial neighborhood” so that we can see how related an observation is to the observations near it. There are many ways to define neighbors, and the decision about what definition to use should reflect the processes you hypothesize might be shaping the spatial pattern. For example, if you are studying the spread of a disease, it may be appropriate to define neighbors based on travel networks or adjacency of regions. If you are examining soil properties, a distance-based neighborhood might better reflect how conditions diffuse across space. Some neighborhood definitions can be used for both areal and point datasets and some are specific to the unit of analysis.

We can explore three different neighborhood definitions using our datasets:

16.2.1 Distance Threshold (Point and Areal)

One way to determine neighbors is to set a distance threshold. Any observations within a set distance would be considered a neighbor. For our census tracts, we could set a distance of 3 miles. The code below demonstrates the neighborhood for the census tract that campus is in.

#isolate ch tract
ch_tract <- tracts_sf |> filter(GEOID == 37135011400)

#get 10 mile radius
neighborhood_rad <- ch_tract |> st_buffer(15840)

#get neighbors
neighbors <- st_filter(tracts_sf, neighborhood_rad)
tm_shape(neighbors) + tm_polygons() + tm_shape(ch_tract) + tm_polygons(fill = "blue")

For our school dataset, we could say that any school within 20 miles is a neighbor

#isolate Chapel Hill High School
chhs <- schools_sf |> filter(school_nam == "Chapel Hill High")

#get 5 mile radius
chhs_neighborhood_rad <- chhs |> st_buffer(26400)

#get neighbors
chhs_neighbors <- st_filter(schools_sf, chhs_neighborhood_rad)


tm_shape(chhs_neighbors) + tm_dots() + tm_shape(chhs) + tm_dots(fill = "blue")

16.2.2 Queens Case (Areal)

The Queens Case neighborhood definition is extremely common in spatial statistics that consider contiguous aerial units (like census tracts). The Queens neighborhood selects neighbors based on any contact (including corners) with the feature of interest. We can’t use contiguity based neighbors for points since they do not touch. The Queens neighborhood for Chapel Hill would look this this:

queen_neighbors <- tracts_sf |> st_filter(ch_tract, .predicates = st_intersects) 

tm_shape(queen_neighbors) + tm_polygons() + tm_shape(ch_tract) + tm_polygons(fill = "blue")

16.3 K-Nearest Neighbors (Areal and Point)

The k-nearest neighbor definition selects the nearest neighbors based on a set number of neighbors. For instance, for our Chapel Hill High point, we could set number of neighbors to 3.

#calculate distances
calc_dist <- chhs_neighbors |> st_distance(chhs)

#assign distances as a variable
chhs_neighbors$distance <- calc_dist

#get 3 closest neighbors (we say 4 since CHHS will be the first closest (0ft))
neighbors <- chhs_neighbors |> arrange(distance) |> slice_head(n = 4)

tm_shape(neighbors) + tm_dots() + tm_shape(chhs) + tm_dots(fill = "blue")

16.4 Calculating Global Autocorrelation

To calculate global autocorrelation, we can use Moran’s I. Moran’s I gives an indication of how clustered, random, or dispersed a dataset value is across space. It goes from -1 (perfect dispersion) to 1 (perfect clustering). Let’s start by making a basic map of our variable of interest.

The main parameter for calculating Moran’s I is the neighborhood definition.

16.4.1 Global Autocorrelation in Median Household Income

We can start by making a basic map of our variable of interest.

tm_shape(tracts_sf) + tm_polygons(fill = "med_hh_inc", fill.scale = tm_scale_intervals(style = "fisher"), col_alpha = 0)

Q1: Just by looking at the values on the map, do you see evidence of a spatial pattern?

We can formalize this by calculating Moran’s I using Queens Case neighbors

#calculate neighborhoods
nb <- poly2nb(tracts_sf, queen = TRUE) # queen shares point or border

#set neighborhood weight matrix
nbw <- nb2listw(nb, style = "W")

#calculate Morans I
gmoran <- moran.test(tracts_sf$med_hh_inc, nbw)
gmoran


    Moran I test under randomisation

data:  tracts_sf$med_hh_inc  
weights: nbw    

Moran I statistic standard deviate = 51.74, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.5886766588     -0.0003806624      0.0001296167

Our Moran’s I indicates strong clustering (.59) at a highly statistically significant level (very small p-value). Therefore, we now know that, at the global scale, this dataset is clustered.

16.4.2 Global Autocorrelation in School Absence Rate

We can map chronic absence rate before Covid (which we’ve done before)

tm_shape(schools_sf) + tm_dots(fill = "pre_covid_pct", fill.scale = tm_scale_intervals(style = "fisher"), col_alpha = 0)

Q2: Just by looking at the values on the map, do you see evidence of a spatial pattern?

We can then calculate Moran’s I using a our 20-mile distance-based neighborhood definition

#calculate neighborhoods
nb_school <- dnearneigh(schools_sf, d1 = 0, d2 = 26400)

Warning in dnearneigh(schools_sf, d1 = 0, d2 = 26400): neighbour object has 247
sub-graphs

#set neighborhood weight matrix
nbw_school <- nb2listw(nb_school, style = "W", zero.policy = T)

#calculate Morans I
gmoran_school <- moran.test(schools_sf$pre_covid_pct, nbw_school)
gmoran_school


    Moran I test under randomisation

data:  schools_sf$pre_covid_pct  
weights: nbw_school  
n reduced by no-neighbour observations  

Moran I statistic standard deviate = 15.434, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2181149958     -0.0004965243      0.0002006250

Q3: What does our Moran I value tell us about the spatial structure of our data?

16.5 Local Indicators of Spatial Autocorrelation (LISA)

We know that both of our datasets are clustered because of our Moran’s I values (with median household income strongly clustered and chronic absence rate weakly clustered). However, we don’t know where, specifically, these high and low clusters are. Local Indicators of Spatial Autocorrelation quantify where statistically significant clusters exist in a dataset.

For the median household income:

locali<-localmoran_perm(tracts_sf$med_hh_inc, nbw) %>%
  as_tibble() %>%
  set_names(c("local_i", "exp_i", "var_i", "z_i", "p_i",
              "p_i_sim", "pi_sim_folded", "skewness", "kurtosis"))

#bind LISA results
tracts_sf <- tracts_sf %>%
  bind_cols(locali)

#Divide into quadrants
tracts_sf <- tracts_sf %>%
  mutate(incz =  as.numeric(scale(med_hh_inc)),
         evratezlag = lag.listw(nbw, incz),
         lisa_cluster = case_when(
           p_i >= 0.05 ~ "Not significant",
           incz > 0 & local_i > 0 ~ "High-high",
           incz > 0 & local_i < 0 ~ "High-low",
           incz < 0 & local_i > 0 ~ "Low-low",
           incz < 0 & local_i < 0 ~ "Low-high"
         ))

#Map
tm_shape(tracts_sf) + tm_polygons("lisa_cluster", col_alpha = 0)

Q2: What is the spatial pattern of clustering in the median household income variable?

For the school chronic absences:

locali_school<-localmoran_perm(schools_sf$pre_covid_pct, nbw_school) %>%
  as_tibble() %>%
  set_names(c("local_i", "exp_i", "var_i", "z_i", "p_i",
              "p_i_sim", "pi_sim_folded", "skewness", "kurtosis"))

#bind LISA results
schools_sf <- schools_sf %>%
  bind_cols(locali_school)

#Divide into quadrants
schools_sf  <- schools_sf  %>%
  mutate(absz =  as.numeric(scale(pre_covid_pct)),
         evratezlag = lag.listw(nbw_school, absz),
         lisa_cluster = case_when(
           p_i >= 0.05 ~ "Not significant",
           absz > 0 & local_i > 0 ~ "High-high",
           absz > 0 & local_i < 0 ~ "High-low",
           absz < 0 & local_i > 0 ~ "Low-low",
           absz < 0 & local_i < 0 ~ "Low-high"
         ))

#Map
tm_shape(schools_sf ) + tm_dots("lisa_cluster", col_alpha = 0)

Q3: What is the spatial pattern of clustering in the school absence variable?

16.6 Mini Challenge

Calculate global and local indicators of spatial autocorrelation for the average commuting time variable and interpret the results
Calculate global and local indicators of spatial autocorrelation for the chronic school absences after covid and interpret the results