12  Creating Analytic Datasets

In this chapter, we will work on manipulating and combining data to create analytic datasets.

12.1 RQ1: What is the distribution of warm summer days in 2025 by North Carolina county?

In this research question, our goal is to create an analytic variable representing the percent of daily average summer temperatures that were above 70 degrees farenheit by NC county

12.1.1 Data:

  • North Carolina county boundaries from US Census Bureau
  • 2025 summer season daily average temperature values for each weather station in NC air quality data by station from North Carolina State Climate Office
library(tidyverse)
library(sf)

nc_counties <- st_read("https://drive.google.com/uc?export=download&id=1g9sGIikgOEubqoj97fUVoCYAKlBDVX5a")

temp <- read_csv("https://drive.google.com/uc?export=download&id=16wJAatPKM0cF7VNy8hwNWOeJiiDNvsTt")

12.1.2 Processing Steps:

  • In the temp object, each row represents a single day in the 2025 summer season (June - August) at a single weather station in North Carolina. Aggregate the temp object to the station level (this will summarize values at that station over the season). The aggregated dataset should include two calculated columns total_obs (total number of observations per station) and total_over_70 (total observations per station where the daily temperature value is over 70 degrees). The group_by command must include site, latitude, and longitude
  • Spatialize the station-level data using the st_as_sf() command. The CRS = 4326.
  • Reproject the nc_counties and spatialized temp objects into CRS = 2264 (North Carolina State Plane, ft) (i.e. nc_counties <- nc_counties |> st_transform(crs = 2264))
  • Execute a spatial join between your reprojected temperature object and reprojected nc_counties object. Drop geometry and aggregate the temp object to the county level (take the sum of the total_obs and total_over_70 columns per county).
  • Calculate a pct_over_70 variable representing the percent of county observations that are over 70 degrees.
  • Use a table join to add your county-aggregated data to the nc_counties object.

12.2 RQ2: How does tree cover canopy vary within walking distance from bus stops in Chapel Hill, NC?

In this research question, our goal is to create an analytic variable representing the percent tree canopy cover within .5 miles of each bus stop in Chapel Hill, NC.

12.2.1 Data:

  • Bus stop locations from Chapel Hill Open Data
  • Tree canopy cover from NLCD. Each pixel value represents the percent of tree canopy cover in that pixel.
library(terra)
library(exactextractr)

bus_stops <- st_read("https://drive.google.com/uc?export=download&id=1jRINUl-5uBAsBcWnKTRmdUO7ZTs1EV4G")

tree_canopy <- rast("https://drive.google.com/uc?export=download&id=1_SqO-ocyLCg3g1Mv1ZbqOd5Pa1sTddps")

12.2.2 Processing Steps:

  • Transform bus_stops object into EPSG:3857 to match the tree canopy projection. Note that the units in this projection are meters
  • Buffer the bus_stops 804.672 meters (.5 miles)
  • Use the exact_extract() function to add a field to the buffered bus stop object that represents the average canopy cover
  • Create a simplified dataset that includes only the following fields: STOP_ID, average tree canopy variable

12.3 RQ3: How accessible are bus stops to Chapel Hill addresses?

In this research question, our goal is to create an analytic variable representing the distance of each address in Chapel Hill to the nearest bus stop.

12.3.1 Data:

ch_addresses <- st_read("https://drive.google.com/uc?export=download&id=1fFXfEbOWjwfsT_JLeLYnaVPkbJCGYA2_")

12.3.2 Processing Steps:

  • Transform the ch_addresses object into EPSG:3857 to match the bus stop object.
  • Calculate the distance (in meters) from each address to the nearest bus stop.
  • Create a simplified dataset that includes only the following fields: OBJECTID, LBCSDesc, distance variable