library(tidyverse)
library(sf)
nc_counties <- st_read("https://drive.google.com/uc?export=download&id=1g9sGIikgOEubqoj97fUVoCYAKlBDVX5a")
aq <- read_csv("https://drive.google.com/uc?export=download&id=1sP4D1SdDorB6c_MZo7H4mpkHoKOaNqIT")12 Creating Analytic Datasets
In this chapter, we will work on manipulating and combining data to create analytic datasets.
12.1 RQ1: What is the distribution of moderate or worse air quality days by NC county in 2025?
In this research question, our goal is to create an analytic variable representing the percent of air quality observations per county in 2025 that were moderate or worse air quality.
12.1.1 Data:
- North Carolina county boundaries from US Census Bureau
- 2025 air quality data by station from US EPA
12.1.2 Processing Steps:
- In the
aqobject, each row represents a single day in 2025 at a single station. Aggregate theaqobject to the station level (this will summarize values at that station over the year). The aggregated dataset should include two calculated columnstotal_obs(total number of observations in the year)total_less(total observations less than good AQI (AQI > 50). Thegroup_bycommand must include latitude and longitude as well as station ID. - Spatialize the station data using the
st_as_sf()command. The CRS = 4326 - Reproject the
nc_countiesandaqobjects into CRS = 2264 (North Carolina State Plane, ft) - Execute a spatial join between
aqandnc_counties. Drop geometry and aggregate theaqobject to the county level (take the sum of thetotal_obsandtotal_lesscolumns per county). - Calculate a
perc_lessvariable representing the percent of county observations that are less than good AQI. - Use a table join to add your county-aggregated data to the
nc_countiesobject. - Create a simplified dataset that includes only the following fields:
GEOID,NAME,total_obs,total_less,perc_less.
12.2 RQ2: How does tree cover canopy vary within walking distance from bus stops in Chapel Hill, NC?
In this research question, our goal is to create an analytic variable representing the percent tree canopy cover within .5 miles of each bus stop in Chapel Hill, NC.
12.2.1 Data:
- Bus stop locations from Chapel Hill Open Data
- Tree canopy cover from NLCD. Each pixel value represents the percent of tree canopy cover in that pixel.
library(terra)
library(exactextractr)
bus_stops <- st_read("https://drive.google.com/uc?export=download&id=1jRINUl-5uBAsBcWnKTRmdUO7ZTs1EV4G")
tree_canopy <- rast("https://drive.google.com/uc?export=download&id=1_SqO-ocyLCg3g1Mv1ZbqOd5Pa1sTddps")12.2.2 Processing Steps:
- Transform
bus_stopsobject into EPSG:3857 to match the tree canopy projection. Note that the units in this projection are meters - Buffer the
bus_stops804.672 meters (.5 miles) - Use the
exact_extract()function to add a field to the buffered bus stop object that represents the average canopy cover - Create a simplified dataset that includes only the following fields:
STOP_ID, average tree canopy variable
12.3 RQ3: How accessible are bus stops to Chapel Hill addresses?
In this research question, our goal is to create an analytic variable representing the distance of each address in Chapel Hill to the nearest bus stop.
12.3.1 Data:
- Bus stop locations from Chapel Hill Open Data
- Chapel Hill addresses from Chapel Hill Open Data
ch_addresses <- st_read("https://drive.google.com/uc?export=download&id=1fFXfEbOWjwfsT_JLeLYnaVPkbJCGYA2_")12.3.2 Processing Steps:
- Transform the
ch_addressesobject into EPSG:3857 to match the bus stop object. - Calculate the distance (in meters) from each address to the nearest bus stop.
- Create a simplified dataset that includes only the following fields:
OBJECTID,LBCSDesc, distance variable