class: inverse, center, middle # 36-315: Statistical Graphics and Visualization ## Lecture 2 Meghan Hall <br> Department of Statistics & Data Science <br> Carnegie Mellon University <br> May 24, 2021 --- layout: true <div class="my-footer"><span>cmu-36315.netlify.app</span></div> --- # From last time <br> .large[Syllabus overview] <br> .medium[Email me with any questions] <br> .large[Why do we visualize data?] <br> .medium[Explore, diagnose, explain] --- # Updates <br> .large[Alternate lab] <br> .medium[8:00pm-9:20pm EDT, Tuesday/Thursday] <br> .large[Office hours] <br> .medium[Listed on Canvas, start Friday!] --- # Today <br> .large[The grammar of graphics] <br> .medium[How graphics are constructed in R] <br> .large[Tidyverse principles] <br> .medium[For any necessary data manipulation] --- # ggplot <br> <br> <br> .large[What exactly is the **g**rammar of **g**raphics?] <br> <br> .medium["the whole system and structure of a language"] --- # ggplot <br> .large[A graphic...] <br> <br> .medium[maps the **data** +] <br> <br> .medium[to the **aesthetic attributes** +] <br> <br> .medium[of **geometric points** +] <br> <br> -- .medium[*with possible **statistical transformations** +*] <br> <br> .medium[*different **coordinate systems** +*] <br> <br> .medium[*and **faceting** *] --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] --- # How do we "map" data? <br> .medium[Encoding data into visual cues dictates how we signify *changes* and *comparisons*] <br> <br> .medium[A non-exhaustive list:] * length * color (or saturation) * position * size * shape * area * volume * angle --- class: center # .left[Today's data] <br> <br> .pull-left[ <img src="figs/Lec2/penguinshex.png" width="60%"/> ] .pull-right[ <img src="figs/Lec2/penguins.png" width="100%"/> ] <br> <br> .right[*Artwork by @allison_horst*] --- # How are we encoding this? <img src="figs/Lec2/color-1.png" width="504" style="display: block; margin: auto;" /> --- # How are we encoding this? <img src="figs/Lec2/length-1.png" width="504" style="display: block; margin: auto;" /> --- # How are we encoding this? <img src="figs/Lec2/length-color-1.png" width="504" style="display: block; margin: auto;" /> -- .right.large[🤔] --- # How are we encoding this? <img src="figs/Lec2/size-1.png" width="504" style="display: block; margin: auto;" /> -- .right.large[🤔] --- # How are we encoding this? <img src="figs/Lec2/pie-1.png" width="504" style="display: block; margin: auto;" /> -- .right.large[🤔] --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] -- 1. **layer** <br> <br> 2. scale <br> <br> 3. coord <br> <br> 4. facet <br> <br> 5. theme --- # Mapping components: layer <br> .large[**geom**] <br> .medium[Geometric elements (bars, lines, points, etc.)] -- <br> .large[**stat**] <br> .medium[Statistical transformations (summarizing data)] --- # Mapping components: layer ```r penguins %>% ggplot(aes(x = species)) ``` -- <img src="figs/Lec2/layer-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: layer ```r penguins %>% ggplot(aes(x = species)) + geom_bar() ``` -- <img src="figs/Lec2/layer-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: layer ```r penguins %>% * group_by(species) %>% * summarize(mass_mean = mean(body_mass_g, na.rm = TRUE)) %>% ggplot(aes(x = species, y = mass_mean)) + # what happens if you don't include the stat argument? geom_bar(stat = "identity") ``` -- <img src="figs/Lec2/layer-3-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: layer ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() ``` -- <img src="figs/Lec2/layer-4-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: layer ```r penguins %>% ggplot(aes(x = sex, y = body_mass_g)) + geom_boxplot() ``` -- <img src="figs/Lec2/layer-5-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: layer ```r penguins %>% ggplot(aes(x = body_mass_g)) + geom_histogram() ``` -- <img src="figs/Lec2/layer-6-1.png" width="504" style="display: block; margin: auto;" /> --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] 1. layer <br> <br> 2. **scale** <br> <br> 3. coord <br> <br> 4. facet <br> <br> 5. theme --- # Mapping components: scale <br> .large[data ➡️ aesthetics] <br> .medium[shape, color, size, etc.] -- <br> .large[how you interpret the plot] <br> .medium[scales and legends] --- # Mapping components: scale ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() ``` <img src="figs/Lec2/scale-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: scale ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(color = "blue", size = 2) ``` -- <img src="figs/Lec2/scale-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: scale ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(color = species), size = 2) ``` -- <img src="figs/Lec2/scale-3-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: scale ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(color = species), size = 2) + scale_color_manual(values = c("#aa6600","#666666","#224477")) ``` -- <img src="figs/Lec2/scale-4-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: scale ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(color = species), size = 2) + scale_color_manual(values = c("#aa6600","#666666","#224477")) + scale_x_continuous(name = "Bill Length (mm)", breaks = seq(30, 60, by = 5), limits = c(30, 60)) ``` -- <img src="figs/Lec2/scale-5-1.png" width="504" style="display: block; margin: auto;" /> --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] 1. layer <br> <br> 2. scale <br> <br> 3. **coord** <br> <br> 4. facet <br> <br> 5. theme --- # Mapping components: coord <br> .large[x and y] <br> .medium[or latitude and longitude, or radius and angle] -- <br> .large[we'll discuss more about maps later!] --- # Mapping components: coord ```r penguins %>% group_by(species) %>% summarize(mass_mean = mean(body_mass_g, na.rm = TRUE)) %>% ggplot(aes(x = species, y = mass_mean)) + geom_bar(stat = "identity") ``` <img src="figs/Lec2/coord-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: coord ```r penguins %>% group_by(species) %>% summarize(mass_mean = mean(body_mass_g, na.rm = TRUE)) %>% ggplot(aes(x = species, y = mass_mean)) + geom_bar(stat = "identity") + * coord_flip() ``` -- <img src="figs/Lec2/coord-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: coord ```r penguins %>% count(sex) %>% ggplot(aes(x = "", y = n, fill = sex)) + geom_bar(stat = "identity", width = 1, color = "white") + * coord_polar("y", start = 0) + theme_void() ``` -- <img src="figs/Lec2/coord-3-1.png" width="504" style="display: block; margin: auto;" /> -- .right.large[🤔] --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] 1. layer <br> <br> 2. scale <br> <br> 3. coord <br> <br> 4. **facet** <br> <br> 5. theme --- # Mapping components: facet <br> .large[create small multiples] <br> .medium[useful for the *explore* part of data viz] --- # Mapping components: facet ```r penguins %>% filter(!is.na(sex)) %>% ggplot(aes(x = sex, y = body_mass_g)) + geom_point(position = "jitter") ``` <img src="figs/Lec2/facet-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: facet ```r penguins %>% filter(!is.na(sex)) %>% ggplot(aes(x = sex, y = body_mass_g)) + geom_point(position = "jitter") + * facet_wrap(~species) ``` -- <img src="figs/Lec2/facet-2-1.png" width="504" style="display: block; margin: auto;" /> --- # ggplot <br> <br> <br> .huge.center[Data **+** Mapping] 1. layer <br> <br> 2. scale <br> <br> 3. coord <br> <br> 4. facet <br> <br> 5. **theme** --- # Mapping components: theme <br> .large[adjust individual pieces of the plot] <br> .medium[font size, gridlines, legend position, etc.] -- <br> .large[or go full out with a custom theme!] --- # Mapping components: theme ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(color = species), size = 2) + scale_color_manual(values = c("#aa6600","#666666","#224477")) ``` <img src="figs/Lec2/theme-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: theme ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(color = species), size = 2) + scale_color_manual(values = c("#aa6600","#666666","#224477")) + theme(legend.position = "top", legend.title = element_blank()) ``` -- <img src="figs/Lec2/theme-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: theme ```r penguins %>% filter(!is.na(sex)) %>% ggplot(aes(x = sex, y = body_mass_g)) + geom_point(position = "jitter") + facet_wrap(~species) ``` <img src="figs/Lec2/theme-3-1.png" width="504" style="display: block; margin: auto;" /> --- # Mapping components: theme ```r penguins %>% filter(!is.na(sex)) %>% ggplot(aes(x = sex, y = body_mass_g)) + geom_point(position = "jitter") + facet_wrap(~species) + theme(panel.grid.major.x = element_blank()) ``` -- <img src="figs/Lec2/theme-4-1.png" width="504" style="display: block; margin: auto;" /> --- # ggplot in review <br> .large[A graphic...] <br> <br> .medium[maps the **data** +] <br> <br> .medium[to the **aesthetic attributes** +] <br> <br> .medium[of **geometric points** +] <br> <br> .medium[*with possible **statistical transformations** +*] <br> <br> .medium[*different **coordinate systems** +*] <br> <br> .medium[*and **faceting** *] --- # What's the tidyverse? .center[![tidyverse packages](figs/Lec2/tidyverse.png)] --- # What's the tidyverse? <br> .large[An "opinionated" set of packages] <br> .medium[Similar philosophies, grammar, data structures] <br> .large[`dplyr`, `stringr`, `tidyr`, `readr`, `forcats`] <br> .medium[Useful for basic data manipulation] <br> .large[Best resource: [r4ds.had.co.nz/](https://r4ds.had.co.nz/)] --- # Useful functions from `dplyr` ```r penguins %>% count(species) ``` ``` ## # A tibble: 3 x 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124 ``` -- ```r penguins %>% filter(species == "Gentoo") %>% count(species) ``` ``` ## # A tibble: 1 x 2 ## species n ## <fct> <int> ## 1 Gentoo 124 ``` --- # Useful functions from `dplyr` ```r penguins %>% filter(species %in% c("Gentoo","Chinstrap")) %>% count(species) ``` ```r penguins %>% filter(species == "Gentoo" | species == "Chinstrap") %>% count(species) ``` ```r penguins %>% filter(species != "Adelie") %>% count(species) ``` ``` ## # A tibble: 2 x 2 ## species n ## <fct> <int> ## 1 Chinstrap 68 ## 2 Gentoo 124 ``` --- # Useful functions from `dplyr` ```r penguins %>% count(sex) ``` ``` ## # A tibble: 3 x 2 ## sex n ## <fct> <int> ## 1 female 165 ## 2 male 168 ## 3 <NA> 11 ``` ```r penguins %>% filter(!is.na(sex)) %>% count(sex) ``` ``` ## # A tibble: 2 x 2 ## sex n ## <fct> <int> ## 1 female 165 ## 2 male 168 ``` --- # Useful functions from `dplyr` ```r penguins %>% group_by(species) %>% summarize(mass_mean = mean(body_mass_g)) ``` ``` ## # A tibble: 3 x 2 ## species mass_mean ## <fct> <dbl> ## 1 Adelie NA ## 2 Chinstrap 3733. ## 3 Gentoo NA ``` -- ```r penguins %>% filter(is.na(body_mass_g)) ``` ``` ## # A tibble: 2 x 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… NA NA NA NA <NA> ## 2 Gentoo Biscoe NA NA NA NA <NA> ## # … with 1 more variable: year <int> ``` --- # Useful functions from `dplyr` ```r penguins %>% group_by(species) %>% summarize(mass_mean = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 x 2 ## species mass_mean ## <fct> <dbl> ## 1 Adelie 3701. ## 2 Chinstrap 3733. ## 3 Gentoo 5076. ``` --- # Useful functions from `dplyr` <br> .medium[`select()` is like `filter()` but for variables instead of observations] <br> <br> .medium[`arrange()` sorts data] <br> <br> .medium[`mutate()` creates new variables (`ifelse` and `case_when` are often useful)] <br> <br> .medium[`rename()` does exactly what you think] <br> <br> .medium[`left_join()` (and other joins) combines data frames based on common keys] --- # Useful functions from `stringr` and `tidyr` <br> .medium[`str_detect()` detects whether or not a pattern is present in a string] <br> <br> .medium[`str_replace()` replaces a pattern in a string with something else] <br> <br> -- <br> .medium[`pivot_longer()` "lengthens" data, increasing the number of rows & decreasing the number of columns] <br> <br> .medium[`pivot_wider()` does the opposite] --- # Upcoming <br> .large[Lab 1 on Tuesday May 25] <br> .medium[Assignments due 11:30am EDT Wednesday] <br> .large[Lecture 3 on Wednesday May 26] <br> .medium[Data types and bar charts]