
03: Linking to Health Outcomes/Patient Data
Dr. Nelson Roque
03-link-health-outcomes.Rmd
Goals of this Notebook
This notebook orients you to tidypollute
functions in a
functional example (pun intended).
This request receives data from EPA AirData flat files, for
analyte (pollutant) == LEAD
over a given
time range (1999-2000), and links it to fake participant data.
library(tidypollute)
lead <- get_epa_airdata(
analyte = "LEAD",
start_year = 1999,
end_year = 2000,
freq = "daily"
)
##
## Preparing to download:
## Analyte: LEAD
## Years: 1999 to 2000
## Number of files: 2
## Freq of data: daily
## Output directory: data/
Load participant data
This is sample participant data, where you have a de-identified
participant_id
, other variables (age
,
smoking_status
, and dementia
status), and a
set of dates for which to compute exposures within
(start_date
, end_date
). If you need to compute
many windows for a given participant, the idea would be to have as many
rows in participants_df as you need data lookups for computing
exposures.
library(dplyr)
participants_df <- tibble::tibble(
participant_id = 1:5,
start_date = as.Date(c("1999-06-01", "1999-01-01", "1999-03-15", "1999-07-10", "1999-09-20")),
end_date = as.Date(c("2000-12-31", "2000-12-31", "2000-09-30", "2000-05-20", "2000-06-15")),
age = c(65, 72, 50, 60, 58),
smoking_status = c("Never", "Former", "Current", "Never", "Former"),
county_name = c("Kern", "Miami-Dade", "Broward", "Miami-Dade", "Broward"),
state_name = c("California", "Florida", "Florida", "Florida", "Florida"),
dementia = c(0, 1, 1, 0, 1)
)
# manually compute for kern county for participant 1
p1_test = lead %>%
dplyr::filter(county_name == "Kern") %>%
dplyr::filter(date_local >= "1999-06-01" & date_local <= "2000-12-31") %>%
dplyr::summarise(
mean = mean(arithmetic_mean, na.rm = TRUE),
median = median(arithmetic_mean, na.rm = TRUE),
sd = sd(arithmetic_mean, na.rm = TRUE),
n = n()
)
knitr::kable(p1_test) %>% kableExtra::kable_paper()
mean | median | sd | n |
---|---|---|---|
0.0062301 | 0.005 | 0.0047977 | 113 |
exposure_df <- summarise_exposure(
participants_df = participants_df,
air_quality_df = lead,
date_col = "date_local",
pollutant_col = "arithmetic_mean",
start_col = "start_date",
end_col = "end_date",
county_name = "county_name",
state_name = "state_name",
group_vars = c("participant_id", "age", "smoking_status", "dementia")
)
knitr::kable(exposure_df) %>% kableExtra::kable_paper()
participant_id | age | smoking_status | dementia | mean_exposure | median_exposure | sd_exposure | n_exposure_records |
---|---|---|---|---|---|---|---|
1 | 65 | Never | 0 | 0.0062301 | 0.005 | 0.0047977 | 113 |
3 | 50 | Current | 1 | 0.0245161 | 0.010 | 0.0455451 | 62 |
5 | 58 | Former | 1 | 0.0284848 | 0.010 | 0.0608852 | 33 |
Summary
This notebook provides an orientation how you can leverage
tidypollute
data and merge with participant/health outcome
data.
For more details, check out tidypollute
documentation.