03: Linking to Health Outcomes/Patient Data

Goals of this Notebook

This notebook orients you to tidypollute functions in a functional example (pun intended).

This request receives data from EPA AirData flat files, for analyte (pollutant) == LEAD over a given time range (1999-2000), and links it to fake participant data.

library(tidypollute)
lead <- get_epa_airdata(
  analyte = "LEAD",
  start_year = 1999,
  end_year = 2000,
  freq = "daily"
)

## 
## Preparing to download:
## Analyte: LEAD 
## Years: 1999 to 2000 
## Number of files: 2 
## Freq of data: daily 
## Output directory: data/

Load participant data

This is sample participant data, where you have a de-identified participant_id, other variables (age, smoking_status, and dementia status), and a set of dates for which to compute exposures within (start_date, end_date). If you need to compute many windows for a given participant, the idea would be to have as many rows in participants_df as you need data lookups for computing exposures.

library(dplyr)
participants_df <- tibble::tibble(
  participant_id = 1:5,
  start_date = as.Date(c("1999-06-01", "1999-01-01", "1999-03-15", "1999-07-10", "1999-09-20")),
  end_date = as.Date(c("2000-12-31", "2000-12-31", "2000-09-30", "2000-05-20", "2000-06-15")),
  age = c(65, 72, 50, 60, 58),
  smoking_status = c("Never", "Former", "Current", "Never", "Former"),
  county_name = c("Kern", "Miami-Dade", "Broward", "Miami-Dade", "Broward"),
  state_name = c("California", "Florida", "Florida", "Florida", "Florida"),
  dementia = c(0, 1, 1, 0, 1)
)

# manually compute for kern county for participant 1
p1_test = lead %>%
  dplyr::filter(county_name == "Kern") %>%
  dplyr::filter(date_local >= "1999-06-01" & date_local <= "2000-12-31") %>%
  dplyr::summarise(
    mean = mean(arithmetic_mean, na.rm = TRUE),
    median = median(arithmetic_mean, na.rm = TRUE),
    sd = sd(arithmetic_mean, na.rm = TRUE),
    n = n()
  )

knitr::kable(p1_test) %>% kableExtra::kable_paper()

mean	median	sd	n
0.0062301	0.005	0.0047977	113

exposure_df <- summarise_exposure(
  participants_df = participants_df,
  air_quality_df = lead,
  date_col = "date_local",
  pollutant_col = "arithmetic_mean",
  start_col = "start_date",
  end_col = "end_date",
  county_name = "county_name",
  state_name = "state_name",
  group_vars = c("participant_id", "age", "smoking_status", "dementia")
)

knitr::kable(exposure_df) %>% kableExtra::kable_paper()

participant_id	age	smoking_status	dementia	mean_exposure	median_exposure	sd_exposure	n_exposure_records
1	65	Never	0	0.0062301	0.005	0.0047977	113
3	50	Current	1	0.0245161	0.010	0.0455451	62
5	58	Former	1	0.0284848	0.010	0.0608852	33

Summary

This notebook provides an orientation how you can leverage tidypollute data and merge with participant/health outcome data.

For more details, check out tidypollute documentation.

Dr. Nelson Roque

Goals of this Notebook

Load participant data

Summary