17 Bivariate Relationships and Regression

In this chapter, we will explore relationships between two variables. There are multiple ways to explore bivariate relationships, including visual and descriptive approaches such as scatterplots and bivariate maps that allow us to identify patterns without making causal claims. These methods help us understand how two variables co-vary across space.

Linear regression provides a statistical framework for modeling relationships between two variables and evaluating whether an observed association is unlikely to have occurred by chance. Ordinary least squares (OLS) regression allows us to test hypotheses about the strength and direction of a relationship. However, like hypothesis tests, standard regression models assume that observations are independent. When spatial dependence is present, OLS regression can produce biased estimates. When spatial dependence is present, we can use alternative, spatially-explicit regression approaches.

The models in this chapter (OLS, SLM, SEM) are based on frequentist inference. This means we treat our parameters (like the effect of a Bachelor’s degree) as fixed, unknown constants. We use Maximum Likelihood Estimation (MLE) to find the single most likely value for these parameters based on our data.

We will use a spatial dataset representing decrease in workplace mobility in the first months of the COVID-19 pandemic across North Carolina counties, as well as 2019 ACS estimates for Bachelor’s degree attainment.

To follow along with this tutorial, make a new .Rmd document. As you move through the tutorial add chunks, headers, and relevant text to your document.

library(tidyverse)
library(tmap)
library(sf)
library(spatialreg)
library(spdep)

#turn negative into positive, higher values mean MORE workplace change (i.e. less travel to workplaces)
county_covid <- st_read("https://drive.google.com/uc?export=download&id=15njmLhQMvg4KNpFthR0XyS1LD3reotus") |> mutate(workplace_change = workplace_change * -1)

17.1 Testing Correlation

Correlation provides a single summary measure of the strength and direction of association between two numeric variables. The Pearson correlation coefficient ranges from −1 to 1, where values closer to −1 or 1 indicate stronger linear relationships, and values near 0 indicate little to no linear association. Correlation is useful as an initial diagnostic

#test correlation
cor(county_covid$pct_bachelors, county_covid$workplace_change)

[1] 0.7981348

Q1: What does our correlation value tell us about the strength and direction of the relationship between changes in workplace mobility during covid and the percent of people with a Bachelor’s degree in each county?

17.2 Scatterplot

One of the simplest visualizations we can use to explore bivariate relationships is a scatterplot. For instance, if we want to explore the relationship between our no health insurance variable and our poverty variable, we could do the following

ggplot(county_covid, aes(x = pct_bachelors, y= workplace_change))  + geom_point()

17.3 Bivariate Map

While scatterplots ignore geography, bivariate maps allow us to explore how two variables co-vary across space. A bivariate map classifies each variable into categories (for instance, quantiles) and combines them into a single color scheme, making it possible to identify areas where both variables are simultaneously high or low.

#bivariate map based on quantiles
tm_shape(county_covid) + 
  tm_polygons(fill = tm_vars(c("pct_bachelors", "workplace_change"), multivariate = TRUE),
              fill.scale = tm_scale_bivariate(values ="bivario.folk_warmth", 
                scale1 = tm_scale_intervals(n = 4, style = "quantile", labels = c("Lo", "", "", "Hi")), 
                scale2 = tm_scale_intervals(n=4, style = "quantile", labels = c("Lo", "", "", "Hi"))))

Q2: What does our bivariate map tell us about where the relationships between these two variables are the strongest?

17.4 Regression

Linear regression provides a formal statistical framework for modeling the relationship between two variables. In an ordinary least squares (OLS) regression, we estimate the expected value of a dependent variable as a linear function of an independent variable. OLS allows us to quantify the strength and direction of the relationship and to assess whether the observed association is unlikely to have occurred by chance.

For the following model, assume that the assumptions (besides independence) have been met to use an OLS. To interpret the strength, direction, and significance of the model below, we will examine the coefficients, adjusted R-squared, and the p-values of the coefficient and the model overall. For example, this model tells us that the baseline percent change (with 0% bachelor degree) is 17%, that every .5% increase in Bachelor’s degree percentage is associated with a 1% increase in workplace change (decreased mobility). This coefficient is statistically significant. Our Adjusted R-squared value tells us that our model explains 63% of the variation in workplace change and our model p-value tells us that the model itself is statistically significant.

#traditional regression
ols_model <- lm(workplace_change ~ pct_bachelors, data = county_covid)

#see results
summary(ols_model)


Call:
lm(formula = workplace_change ~ pct_bachelors, data = county_covid)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2312  -2.7965  -0.1164   2.1867  18.2634 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    17.7051     1.1202   15.80   <2e-16 ***
pct_bachelors   0.5452     0.0420   12.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.254 on 96 degrees of freedom
Multiple R-squared:  0.637, Adjusted R-squared:  0.6332 
F-statistic: 168.5 on 1 and 96 DF,  p-value: < 2.2e-16

However, OLS regression relies on several assumptions, including that observations are independent. One way to evaluate whether spatial dependence is present is to examine the residuals from an OLS regression. If the model has adequately captured the underlying process, residuals should be randomly distributed across space. Spatial clustering in residuals suggests that important spatial structure remains unmodeled. A statistically significant Moran’s I indicates that the independence assumption of OLS has been violated and that a spatial regression approach may be more appropriate.

#add residuals as a variable
county_covid <-county_covid |> mutate(olsresid = resid(ols_model)) 

#map residuals
tm_shape(county_covid) + tm_polygons(fill = "olsresid",fill.scale = tm_scale_intervals(values = "bu_pu", style = "quantile", n = 5))

Multiple palettes called "bu_pu" found: "brewer.bu_pu", "matplotlib.bu_pu". The first one, "brewer.bu_pu", is returned.

#calculate neighborhoods
nb <- poly2nb(county_covid, queen = TRUE)

#set neighborhood weight matrix
nbw <- nb2listw(nb, style = "W", zero.policy = T)

#calculate Morans I
gmoran <- moran.test(county_covid$olsresid, nbw)
gmoran


    Moran I test under randomisation

data:  county_covid$olsresid  
weights: nbw    

Moran I statistic standard deviate = 5.3026, p-value = 5.71e-08
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.330622034      -0.010309278       0.004133931

Q3: Do our results indicate that we should use a spatial model?

17.5 Spatial Regression

When spatial dependence is present, we can use spatial regression models that explicitly incorporate spatial structure. Two common approaches are the spatial lag model and the spatial error model. Traditional spatial regressions considers spatial autocorrelation as a property that should be accounted for or removed to obtain unbiased estimates. While frequentist models like SEM treat spatial autocorrelation as ‘noise to be removed,’ we can also think of it as a latent (hidden) process. In the next unit, we will see how Bayesian models treat this spatial structure as a ‘random effect’ that we can map and analyze in its own right.

There are two main models that can account for spatial dependence.

The spatial lag model includes a spatially lagged version of the dependent variable, allowing outcomes in one location to be influenced by outcomes in neighboring locations. This approach is appropriate when the process itself is spatially contagious or when values diffuse across space.

The spatial error model, in contrast, assumes that spatial dependence operates through the error term rather than through direct interaction between dependent variables. This approach is appropriate when unobserved spatial processes affect the outcome but are not explicitly included in the model.

#lag model
fit.lag<-lagsarlm(workplace_change ~ pct_bachelors,  
                  data = county_covid, 
                  listw = nbw, zero.policy = T) 
#see results
summary(fit.lag)


Call:lagsarlm(formula = workplace_change ~ pct_bachelors, data = county_covid, 
    listw = nbw, zero.policy = T)

Residuals:
        Min          1Q      Median          3Q         Max 
-10.0665446  -2.0402522  -0.0087854   1.4828038  16.3029135 

Type: lag 
Coefficients: (asymptotic standard errors) 
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    5.65338    2.40201  2.3536  0.01859
pct_bachelors  0.47263    0.03912 12.0814  < 2e-16

Rho: 0.44395, LR test value: 23.78, p-value: 1.0797e-06
Asymptotic standard error: 0.082584
    z-value: 5.3757, p-value: 7.6272e-08
Wald statistic: 28.898, p-value: 7.6272e-08

Log likelihood: -268.0403 for lag model
ML residual variance (sigma squared): 13.256, (sigma: 3.6409)
Number of observations: 98 
Number of parameters estimated: 4 
AIC: 544.08, (AIC for lm: 565.86)
LM test for residual autocorrelation
test value: 0.8593, p-value: 0.35393

#error model
fit.error<-errorsarlm(workplace_change ~ pct_bachelors,  
                  data = county_covid, 
                  listw = nbw, zero.policy = T)  
#see results
summary(fit.error)


Call:
errorsarlm(formula = workplace_change ~ pct_bachelors, data = county_covid, 
    listw = nbw, zero.policy = T)

Residuals:
     Min       1Q   Median       3Q      Max 
-8.82518 -2.08173 -0.37913  1.55315 15.64620 

Type: error 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)   18.897359   1.298656  14.552 < 2.2e-16
pct_bachelors  0.501675   0.039237  12.786 < 2.2e-16

Lambda: 0.59034, LR test value: 22.774, p-value: 1.8218e-06
Asymptotic standard error: 0.096719
    z-value: 6.1037, p-value: 1.0366e-09
Wald statistic: 37.255, p-value: 1.0366e-09

Log likelihood: -268.5434 for error model
ML residual variance (sigma squared): 12.816, (sigma: 3.5799)
Number of observations: 98 
Number of parameters estimated: 4 
AIC: 545.09, (AIC for lm: 565.86)

The spatial lag model examines how workplace change in one county is influenced not only by the percent of the population with a bachelor’s degree in that county, but also by workplace change in neighboring counties. The model indicates that for every 1 percentage point increase in the population with a bachelor’s degree, workplace change increases by approximately 0.47 units. The spatial parameter (Rho = 0.44) is positive and significant, meaning counties tend to resemble their neighbors in terms of workplace change. Model fit is improved compared to standard OLS, as shown by a lower AIC (544.08 vs. 565.86). Residuals show no significant spatial autocorrelation after accounting for the lag, suggesting the model adequately captures the spatial dependence.

The spatial error model accounts for unobserved spatial autocorrelation in the error term rather than in the dependent variable. In this model, every 1 percentage point increase in the population with a bachelor’s degree is associated with an increase of roughly 0.50 units in workplace change. The spatial error parameter (Lambda = 0.59) is positive and significant, indicating that unexplained variation in workplace change is spatially structured. Like the lag model, the AIC is lower than OLS (545.09), showing an improvement in model fit. This model assumes that spatial dependence exists in unobserved factors rather than directly through neighboring counties’ values.

Both models show that higher educational attainment is strongly associated with greater workplace change. The choice between the models depends on the assumed mechanism of spatial dependence. The lag model suggests that counties are directly influenced by the workplace change of neighboring counties, while the error model implies that spatial correlation exists only in unmeasured factors. Based on the context of workplace mobility—where nearby counties are likely to respond to similar conditions, policies, or commuting patterns—the spatial error model is likely the more realistic representation of the process.