Ch. 8 Data Wrangling and Visualization of Big Data

8.1 Overview

For this example, we are working with the Google Community Mobility Reports dataset.

“These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.”

Learn More

8.2 Load libs/packages

8.3 Read in data

# read in the Google Mobility dataset
google_mobility <- read_csv("https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv")
## Rows: 10392488 Columns: 15
## ── Column specification ─────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): country_region_code, country_region, sub_region_1, sub_region_2, m...
## dbl  (6): retail_and_recreation_percent_change_from_baseline, grocery_and_ph...
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

8.4 Inspect dataset features

nrow(google_mobility)
## [1] 10392488
ncol(google_mobility)
## [1] 15

8.5 Create lookup tables

R, as well as some packages, have built in datasets. They can be called with the data() function (e.g., data(state)). We will leverage a built in dataset with state metadata.

data(state)
state_lookup = tibble(sub_region_1 = state.name,
                      us_division = state.division,
                      us_region = state.region,
                      state_land_area = state.area)

# for more state data, look at the following data object (data from 1970s)
state.x77
##                Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama              3615   3624        2.1    69.05   15.1    41.3    20
## Alaska                365   6315        1.5    69.31   11.3    66.7   152
## Arizona              2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas             2110   3378        1.9    70.66   10.1    39.9    65
## California          21198   5114        1.1    71.71   10.3    62.6    20
## Colorado             2541   4884        0.7    72.06    6.8    63.9   166
## Connecticut          3100   5348        1.1    72.48    3.1    56.0   139
## Delaware              579   4809        0.9    70.06    6.2    54.6   103
## Florida              8277   4815        1.3    70.66   10.7    52.6    11
## Georgia              4931   4091        2.0    68.54   13.9    40.6    60
## Hawaii                868   4963        1.9    73.60    6.2    61.9     0
## Idaho                 813   4119        0.6    71.87    5.3    59.5   126
## Illinois            11197   5107        0.9    70.14   10.3    52.6   127
## Indiana              5313   4458        0.7    70.88    7.1    52.9   122
## Iowa                 2861   4628        0.5    72.56    2.3    59.0   140
## Kansas               2280   4669        0.6    72.58    4.5    59.9   114
## Kentucky             3387   3712        1.6    70.10   10.6    38.5    95
## Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
## Maine                1058   3694        0.7    70.39    2.7    54.7   161
## Maryland             4122   5299        0.9    70.22    8.5    52.3   101
## Massachusetts        5814   4755        1.1    71.83    3.3    58.5   103
## Michigan             9111   4751        0.9    70.63   11.1    52.8   125
## Minnesota            3921   4675        0.6    72.96    2.3    57.6   160
## Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
## Missouri             4767   4254        0.8    70.69    9.3    48.8   108
## Montana               746   4347        0.6    70.56    5.0    59.2   155
## Nebraska             1544   4508        0.6    72.60    2.9    59.3   139
## Nevada                590   5149        0.5    69.03   11.5    65.2   188
## New Hampshire         812   4281        0.7    71.23    3.3    57.6   174
## New Jersey           7333   5237        1.1    70.93    5.2    52.5   115
## New Mexico           1144   3601        2.2    70.32    9.7    55.2   120
## New York            18076   4903        1.4    70.55   10.9    52.7    82
## North Carolina       5441   3875        1.8    69.21   11.1    38.5    80
## North Dakota          637   5087        0.8    72.78    1.4    50.3   186
## Ohio                10735   4561        0.8    70.82    7.4    53.2   124
## Oklahoma             2715   3983        1.1    71.42    6.4    51.6    82
## Oregon               2284   4660        0.6    72.13    4.2    60.0    44
## Pennsylvania        11860   4449        1.0    70.43    6.1    50.2   126
## Rhode Island          931   4558        1.3    71.90    2.4    46.4   127
## South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
## South Dakota          681   4167        0.5    72.08    1.7    53.3   172
## Tennessee            4173   3821        1.7    70.11   11.0    41.8    70
## Texas               12237   4188        2.2    70.90   12.2    47.4    35
## Utah                 1203   4022        0.6    72.90    4.5    67.3   137
## Vermont               472   3907        0.6    71.64    5.5    57.1   168
## Virginia             4981   4701        1.4    70.08    9.5    47.8    85
## Washington           3559   4864        0.6    71.72    4.3    63.5    32
## West Virginia        1799   3617        1.4    69.48    6.7    41.6   100
## Wisconsin            4589   4468        0.7    72.48    3.0    54.5   149
## Wyoming               376   4566        0.6    70.29    6.9    62.9   173
##                  Area
## Alabama         50708
## Alaska         566432
## Arizona        113417
## Arkansas        51945
## California     156361
## Colorado       103766
## Connecticut      4862
## Delaware         1982
## Florida         54090
## Georgia         58073
## Hawaii           6425
## Idaho           82677
## Illinois        55748
## Indiana         36097
## Iowa            55941
## Kansas          81787
## Kentucky        39650
## Louisiana       44930
## Maine           30920
## Maryland         9891
## Massachusetts    7826
## Michigan        56817
## Minnesota       79289
## Mississippi     47296
## Missouri        68995
## Montana        145587
## Nebraska        76483
## Nevada         109889
## New Hampshire    9027
## New Jersey       7521
## New Mexico     121412
## New York        47831
## North Carolina  48798
## North Dakota    69273
## Ohio            40975
## Oklahoma        68782
## Oregon          96184
## Pennsylvania    44966
## Rhode Island     1049
## South Carolina  30225
## South Dakota    75955
## Tennessee       41328
## Texas          262134
## Utah            82096
## Vermont          9267
## Virginia        39780
## Washington      66570
## West Virginia   24070
## Wisconsin       54464
## Wyoming         97203

8.6 Pre-process data

  • filter data
  • add meta-date features (month, year, day, etc)
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

# filter for US
google_mobility_US <- google_mobility %>%
  filter(country_region == "United States") %>%
  mutate(dt_date = date) %>%
  mutate(dt_month = lubridate::month(dt_date),
         dt_month_ = lubridate::month(dt_date, label=T),
         dt_year= lubridate::year(dt_date),
         dt_day = lubridate::day(dt_date),
         dt_wkday_l = lubridate::wday(dt_date, label=T),
         dt_wkday_n = lubridate::wday(dt_date)) %>%
  mutate(dt_weekend = ifelse(dt_wkday_l %in% c("Sat", "Sun"), TRUE, FALSE))

8.7 Joining datasets

  • Merge Google Mobility dataset with R data for US states
google_mobility_US_w_lu <- google_mobility_US %>%
  inner_join(state_lookup)
## Joining, by = "sub_region_1"

8.8 Quick descriptives

  • single line of code descriptives across a dataset
#skimr::skim(google_mobility_US)

8.9 Filter data

  • filter with specific entries
  • filter for a collection of entries
# filter with specific entries
google_mobility_US_NY <- google_mobility_US %>%
  filter(sub_region_1 == "New York")

google_mobility_US_FL <- google_mobility_US %>%
  filter(sub_region_1 == "Florida")

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

# filter to a 'vector' of entries
google_mobility_US_targeted <- google_mobility_US %>%
  filter(sub_region_1 %in% c("California", "New York", "Florida"))

8.10 Compute aggregates

# base R approach
table(google_mobility_US$sub_region_1)
## 
##              Alabama               Alaska              Arizona 
##                54866                 8699                13531 
##             Arkansas           California             Colorado 
##                53846                48318                38559 
##          Connecticut             Delaware District of Columbia 
##                 7767                 3452                  863 
##              Florida              Georgia               Hawaii 
##                56781               116246                 4315 
##                Idaho             Illinois              Indiana 
##                28010                76376                76459 
##                 Iowa               Kansas             Kentucky 
##                71643                46353                80123 
##            Louisiana                Maine             Maryland 
##                49323                14345                21416 
##        Massachusetts             Michigan            Minnesota 
##                12505                65941                63831 
##          Mississippi             Missouri              Montana 
##                58235                80247                22623 
##             Nebraska               Nevada        New Hampshire 
##                38138                10732                 9468 
##           New Jersey           New Mexico             New York 
##                18978                24916                53244 
##       North Carolina         North Dakota                 Ohio 
##                82033                16312                75182 
##             Oklahoma               Oregon         Pennsylvania 
##                53005                28646                56170 
##         Rhode Island       South Carolina         South Dakota 
##                 5169                39446                21719 
##            Tennessee                Texas                 Utah 
##                74908               163937                19814 
##              Vermont             Virginia           Washington 
##                11754               103423                30440 
##        West Virginia            Wisconsin              Wyoming 
##                36130                57815                16418
# number of records by state
records_per_state = google_mobility_US %>%
  group_by(sub_region_1) %>%
  summarise(n_records = n()) %>%
  arrange(sub_region_1)
records_per_state
## # A tibble: 52 × 2
##    sub_region_1         n_records
##    <chr>                    <int>
##  1 Alabama                  54866
##  2 Alaska                    8699
##  3 Arizona                  13531
##  4 Arkansas                 53846
##  5 California               48318
##  6 Colorado                 38559
##  7 Connecticut               7767
##  8 Delaware                  3452
##  9 District of Columbia       863
## 10 Florida                  56781
## # … with 42 more rows
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

# group data by weekday
metrics_month_weekend_US <- google_mobility_US %>%
  group_by(sub_region_1, dt_year, dt_month_, dt_wkday_l) %>%
  summarise(avg_change_parks = mean(parks_percent_change_from_baseline, na.rm=T),
            avg_change_retail_rec = mean(retail_and_recreation_percent_change_from_baseline, na.rm=T),
            avg_change_workplaces = mean(workplaces_percent_change_from_baseline, na.rm=T)) %>%
  inner_join(state_lookup)
## `summarise()` has grouped output by 'sub_region_1', 'dt_year', 'dt_month_'. You can
## override using the `.groups` argument.
## Joining, by = "sub_region_1"

8.11 Visualize the data

# plot summary data across all states
allUS_retail = ggplot(metrics_month_weekend_US %>% filter(us_region %in% c("South", "West")), aes(sub_region_1, avg_change_retail_rec)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=90)) +
  facet_wrap(dt_year~.) +
  labs(x="Month", y="Retail (% Change)")

allUS_parks = ggplot(metrics_month_weekend_US %>% filter(us_region %in% c("South", "West")), aes(sub_region_1, avg_change_parks)) +
  geom_boxplot() +
  theme_minimal() +
  facet_wrap(dt_year~.) +
  theme(axis.text.x=element_text(angle=90)) +
  labs(x="Month", y="Parks (% Change)")

allUS_retail
allUS_parks
jitter_retail <- ggplot(metrics_month_weekend_US, aes(dt_month_, 
                                     avg_change_retail_rec, 
                                     group=us_region,
                                     color=us_region)) +
  #geom_point() +
  geom_jitter() +
  theme_minimal() +
  labs(x="Month", y="Retail & Recreation (% Change)")

smooth_retail = ggplot(metrics_month_weekend_US, aes(dt_month_, 
                                     avg_change_retail_rec, 
                                     group=us_region,
                                     color=us_region)) +
  #geom_point() +
  #geom_jitter() +
  geom_smooth() +
  theme_minimal() +
  labs(x="Month", y="Retail & Recreation (% Change)")

smooth_parks = ggplot(metrics_month_weekend_US, aes(dt_month_, 
                                     avg_change_parks, 
                                     group=us_region,
                                     color=us_region)) +
  #geom_point() +
  #geom_jitter() +
  geom_smooth() +
  theme_minimal() +
  labs(x="Month", y="Parks (% Change)")

jitter_retail
smooth_retail
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
smooth_parks
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Click here to learn more about geom_smooth

8.12 Combining multiple plots

library(cowplot)

# create a plot of weekly trends
early_covid_retail = ggplot(google_mobility_US_targeted %>%
         filter(dt_month %in% c(1,2,3,4,5)) %>%
         filter(dt_year == 2020), aes(dt_date, 
                                    retail_and_recreation_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  geom_jitter() +
  theme_minimal() +
  facet_wrap(.~sub_region_1) +
  labs(x="Month", y="Retal & Rec (% Change)", title="2020")

early_covid_parks = ggplot(google_mobility_US_targeted %>%
         filter(dt_month %in% c(1,2,3,4,5)) %>%
         filter(dt_year == 2020), aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  geom_jitter() +
  theme_minimal() +
  facet_wrap(.~sub_region_1) +
  labs(x="Month", y="Parks (% Change)", title="2020")

nxtyear_covid_retail = ggplot(google_mobility_US_targeted %>%
         filter(dt_month %in% c(1,2,3,4,5)) %>%
         filter(dt_year == 2021), aes(dt_date, 
                                    retail_and_recreation_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  geom_jitter() +
  theme_minimal() +
  facet_wrap(.~sub_region_1) +
  labs(x="Month", y="Retal & Rec (% Change)", title="2021")
cowplot::plot_grid(early_covid_retail, early_covid_parks, ncol=1)
## Warning: Removed 2050 rows containing missing values (geom_point).
## Warning: Removed 7320 rows containing missing values (geom_point).
cowplot::plot_grid(early_covid_retail, nxtyear_covid_retail, ncol=1)
## Warning: Removed 2050 rows containing missing values (geom_point).
## Warning: Removed 2915 rows containing missing values (geom_point).
cowplot::plot_grid(allUS_retail, allUS_parks, ncol=1)
cowplot::plot_grid(smooth_retail, smooth_parks, ncol=1)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

8.13 Filter dataset for next plots

# Variable names -----
#   retail_and_recreation_percent_change_from_baseline = col_double(),
#   grocery_and_pharmacy_percent_change_from_baseline = col_double(),
#   parks_percent_change_from_baseline = col_double(),
#   transit_stations_percent_change_from_baseline = col_double(),
#   workplaces_percent_change_from_baseline = col_double(),
#   residential_percent_change_from_baseline = col_double()


early_covid_period_data = google_mobility_US_w_lu %>%
         filter(dt_month %in% c(1,2,3,4,5,6)) %>%
         filter(dt_year == 2020) %>%
         filter(sub_region_1 %in% c("New York", "California", "Florida", "Wyoming", "Texas", "Arizona","Montana", "South Dakota", "Nevada"))

8.14 Styling your plots (theme_*)

ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  theme_void() +
  geom_jitter() +
  facet_wrap(.~sub_region_1) +
  theme(legend.position = "none")
## Warning: Removed 40493 rows containing missing values (geom_point).
ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=sub_region_1)) +
  geom_jitter() +
  facet_wrap(.~sub_region_1)
## Warning: Removed 40493 rows containing missing values (geom_point).
ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  theme_bw() +
  geom_jitter() +
  facet_wrap(.~sub_region_1)
## Warning: Removed 40493 rows containing missing values (geom_point).
ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=sub_region_1)) +
  #geom_point() +
  theme_dark() +
  geom_jitter() +
  facet_wrap(.~sub_region_1)
## Warning: Removed 40493 rows containing missing values (geom_point).

8.15 Production plots

ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=us_region)) +
  #geom_point() +
  theme_bw() +
  geom_jitter() +
  facet_wrap(.~sub_region_1) +
  labs(x= "Date", y="Retail & Recreation (% Change)")
## Warning: Removed 40493 rows containing missing values (geom_point).
ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=us_region)) +
  #geom_point() +
  geom_jitter() +
  facet_wrap(.~sub_region_1) +
  labs(x= "Date", y="Retail & Recreation (% Change)") +
  theme_bw(base_size = 16) +
  theme(axis.text.x = element_text(angle=90))
## Warning: Removed 40493 rows containing missing values (geom_point).
fig1 <- ggplot(early_covid_period_data, aes(dt_date, 
                                    retail_and_recreation_percent_change_from_baseline,
                                    color=us_region)) +
  #geom_point() +
  #geom_jitter() +
  geom_smooth() +
  facet_wrap(.~sub_region_1) +
  labs(x= "Date", y="Retail & Recreation (% Change)",
       #color="State",
       tag = "Figure 1",
       title = "Google Mobility: Change over Time",
       subtitle = "i.e., Retail & Recreation, early COVID-19 period (01/20 - 06/20)",
       caption = "Google Mobility: Retail & Recreation (% Change) over early COVID period (01/20 - 06/20)") +
  theme_bw(base_size = 12, base_family = "Arial")

fig2 <- ggplot(early_covid_period_data, aes(dt_date, 
                                    parks_percent_change_from_baseline,
                                    color=us_region)) +
  #geom_point() +
  #geom_jitter() +
  geom_smooth() +
  facet_wrap(.~sub_region_1) +
  labs(x= "Date", y="Retail & Recreation (% Change)",
       #color="State",
       tag = "Figure 2",
       title = "Google Mobility: Change over Time",
       subtitle = "i.e., Parks, early COVID-19 period (01/20 - 06/20)",
       caption = "Google Mobility: Parks (% Change) over early COVID period (01/20 - 06/20)") +
  theme_bw(base_size = 12, base_family = "Arial")
fig1
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 16583 rows containing non-finite values (stat_smooth).
fig2
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 40493 rows containing non-finite values (stat_smooth).

8.16 Evaluate regions of most change

largest_changes_parks_long <- google_mobility_US %>%
  group_by(sub_region_1, dt_year) %>%
  summarise(avg_change_parks = mean(parks_percent_change_from_baseline, na.rm=T)) %>%
  arrange(sub_region_1)
## `summarise()` has grouped output by 'sub_region_1'. You can override using the `.groups`
## argument.
largest_changes_parks_wide <- largest_changes_parks_long %>%
  #pivot_wider(id_cols=c())
  pivot_wider(names_from = dt_year, values_from = avg_change_parks) %>%
  mutate(change_2021_2020 = `2021` - `2020`)

top10_changes <- largest_changes_parks_wide %>%
  arrange(-change_2021_2020) %>%
  head(n=10)

bottom10_changes <- largest_changes_parks_wide %>%
  arrange(change_2021_2020) %>%
  head(n=10)

largest_changes_parks_long
## # A tibble: 156 × 3
## # Groups:   sub_region_1 [52]
##    sub_region_1 dt_year avg_change_parks
##    <chr>          <dbl>            <dbl>
##  1 Alabama         2020           17.9  
##  2 Alabama         2021            7.88 
##  3 Alabama         2022           18.0  
##  4 Alaska          2020           52.9  
##  5 Alaska          2021           59.8  
##  6 Alaska          2022           62.5  
##  7 Arizona         2020           -1.97 
##  8 Arizona         2021           -3.51 
##  9 Arizona         2022           -0.316
## 10 Arkansas        2020           48.2  
## # … with 146 more rows
largest_changes_parks_wide
## # A tibble: 52 × 5
## # Groups:   sub_region_1 [52]
##    sub_region_1         `2020` `2021`  `2022` change_2021_2020
##    <chr>                 <dbl>  <dbl>   <dbl>            <dbl>
##  1 Alabama               17.9    7.88  18.0             -10.0 
##  2 Alaska                52.9   59.8   62.5               6.90
##  3 Arizona               -1.97  -3.51  -0.316            -1.54
##  4 Arkansas              48.2   44.6   46.8              -3.54
##  5 California             6.17   7.84   8.79              1.67
##  6 Colorado              23.2   14.6   -1.69             -8.60
##  7 Connecticut           51.2   44.6   36.4              -6.62
##  8 Delaware              46.8   52.1   31.5               5.30
##  9 District of Columbia -34.8  -17.1  -14.9              17.7 
## 10 Florida              -14.7   -8.25   3.72              6.49
## # … with 42 more rows
top10_changes
## # A tibble: 10 × 5
## # Groups:   sub_region_1 [10]
##    sub_region_1         `2020` `2021` `2022` change_2021_2020
##    <chr>                 <dbl>  <dbl>  <dbl>            <dbl>
##  1 Hawaii                -46.5 -27.8  -16.8             18.6 
##  2 District of Columbia  -34.8 -17.1  -14.9             17.7 
##  3 Rhode Island           69.2  82.3   65.4             13.1 
##  4 South Carolina         23.2  34.7   46.3             11.4 
##  5 Maine                  59.0  70.0   48.3             10.9 
##  6 Vermont                52.8  63.0   60.9             10.2 
##  7 Alaska                 52.9  59.8   62.5              6.90
##  8 Florida               -14.7  -8.25   3.72             6.49
##  9 Tennessee              33.6  39.3   35.6              5.70
## 10 Delaware               46.8  52.1   31.5              5.30
bottom10_changes
## # A tibble: 10 × 5
## # Groups:   sub_region_1 [10]
##    sub_region_1  `2020` `2021` `2022` change_2021_2020
##    <chr>          <dbl>  <dbl>  <dbl>            <dbl>
##  1 Massachusetts  59.3   44.6   28.5            -14.7 
##  2 Utah           61.5   47.6   32.8            -14.0 
##  3 Nebraska       64.3   51.0   38.7            -13.3 
##  4 Kansas         67.3   54.5   44.9            -12.8 
##  5 Mississippi     4.92  -7.07  -6.41           -12.0 
##  6 Alabama        17.9    7.88  18.0            -10.0 
##  7 West Virginia   3.08  -6.73  -2.22            -9.81
##  8 Ohio           67.7   58.5   38.5             -9.21
##  9 Colorado       23.2   14.6   -1.69            -8.60
## 10 Iowa           63.1   55.1   40.9             -8.00