Skip to contents

Goals of this Notebook

This notebook orients you to tidypollute functions in a functional example (pun intended).


This request receives data from EPA AirData flat files, for analyte (pollutant) == LEAD over a given time range (1999-2000), and links it to fake participant data.

library(tidypollute)
lead <- get_epa_airdata(
  analyte = "LEAD",
  start_year = 1999,
  end_year = 2000,
  freq = "daily"
)
## 
## Preparing to download:
## Analyte: LEAD 
## Years: 1999 to 2000 
## Number of files: 2 
## Freq of data: daily 
## Output directory: data/

Load participant data

This is sample participant data, where you have a de-identified participant_id, other variables (age, smoking_status, and dementia status), and a set of dates for which to compute exposures within (start_date, end_date). If you need to compute many windows for a given participant, the idea would be to have as many rows in participants_df as you need data lookups for computing exposures.

library(dplyr)
participants_df <- tibble::tibble(
  participant_id = 1:5,
  start_date = as.Date(c("1999-06-01", "1999-01-01", "1999-03-15", "1999-07-10", "1999-09-20")),
  end_date = as.Date(c("2000-12-31", "2000-12-31", "2000-09-30", "2000-05-20", "2000-06-15")),
  age = c(65, 72, 50, 60, 58),
  smoking_status = c("Never", "Former", "Current", "Never", "Former"),
  county_name = c("Kern", "Miami-Dade", "Broward", "Miami-Dade", "Broward"),
  state_name = c("California", "Florida", "Florida", "Florida", "Florida"),
  dementia = c(0, 1, 1, 0, 1)
)
# manually compute for kern county for participant 1
p1_test = lead %>%
  dplyr::filter(county_name == "Kern") %>%
  dplyr::filter(date_local >= "1999-06-01" & date_local <= "2000-12-31") %>%
  dplyr::summarise(
    mean = mean(arithmetic_mean, na.rm = TRUE),
    median = median(arithmetic_mean, na.rm = TRUE),
    sd = sd(arithmetic_mean, na.rm = TRUE),
    n = n()
  )
knitr::kable(p1_test) %>% kableExtra::kable_paper()
mean median sd n
0.0062301 0.005 0.0047977 113
exposure_df <- summarise_exposure(
  participants_df = participants_df,
  air_quality_df = lead,
  date_col = "date_local",
  pollutant_col = "arithmetic_mean",
  start_col = "start_date",
  end_col = "end_date",
  county_name = "county_name",
  state_name = "state_name",
  group_vars = c("participant_id", "age", "smoking_status", "dementia")
)
knitr::kable(exposure_df) %>% kableExtra::kable_paper()
participant_id age smoking_status dementia mean_exposure median_exposure sd_exposure n_exposure_records
1 65 Never 0 0.0062301 0.005 0.0047977 113
3 50 Current 1 0.0245161 0.010 0.0455451 62
5 58 Former 1 0.0284848 0.010 0.0608852 33

Summary

This notebook provides an orientation how you can leverage tidypollute data and merge with participant/health outcome data.

For more details, check out tidypollute documentation.