Meghan Hall
Department of Statistics & Data Science
Carnegie Mellon University
June 4, 2021
Line graphs
Various techniques and considerations
Working with time
lubridate
package
Homework
due on Tuesday
instructions!
Scatter plots
Considerations, overplotting, line of best fit
Relational data
Practicing joins with dplyr
scatter plots
relational data
dealing with overplotting
bubble chart
scatter plots
relational data
dealing with overplotting
bubble chart
To study a relationship between two numeric variables
can also view by a group (categorical variable)
and sometimes with a third numeric variable
To study a relationship between two numeric variables
can also view by a group (categorical variable)
and sometimes with a third numeric variable
Line graphs (from Wednesday) are just a special kind of scatter plot
with a chronological variable (or proxy of one) on the x
and lines connecting the points to emphasize trends
friends_info
friends_emotions
friends
How does the relationship between viewers and IMDB rating look by:
season
predominant emotion of the episode
focus character
friends_info
season | episode | title | us_views_millions | imdb_rating |
---|---|---|---|---|
1 | 1 | The Pilot | 21.5 | 8.3 |
1 | 2 | The One with the Sonogram at the End | 20.2 | 8.1 |
1 | 3 | The One with the Thumb | 19.5 | 8.2 |
1 | 4 | The One with George Stephanopoulos | 19.7 | 8.1 |
1 | 5 | The One with the East German Laundry Detergent | 18.6 | 8.5 |
1 | 6 | The One with the Butt | 18.2 | 8.1 |
1 | 7 | The One with the Blackout | 23.5 | 9.0 |
1 | 8 | The One Where Nana Dies Twice | 21.1 | 8.1 |
1 | 9 | The One Where Underdog Gets Away | 23.1 | 8.2 |
1 | 10 | The One with the Monkey | 19.9 | 8.1 |
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_point()
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_point(alpha = 0.5, color = "red", size = 2)
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_jitter()
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = season)) + geom_jitter()
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = as.character(season))) + geom_jitter()
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = season)) + geom_jitter(size = 2) + scale_colour_gradient(low = "#fafafa", high = "#191970", breaks = seq(1, 10, 1))
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_jitter(size = 2) + geom_smooth(method = "lm")
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_jitter(size = 2) + geom_smooth(method = "lm", se = FALSE)
friends_info %>% ggplot(aes(x = us_views_millions, y = imdb_rating)) + geom_jitter(size = 2) + geom_smooth(method = "lm", level = 0.99, color = "purple", fill = "#DCD0FF")
How does the relationship between viewers and IMDB rating look by:
season
predominant emotion of the episode
focus character
friends_emotions
season | episode | scene | utterance | emotion |
---|---|---|---|---|
1 | 1 | 4 | 1 | Mad |
1 | 1 | 4 | 3 | Neutral |
1 | 1 | 4 | 4 | Joyful |
1 | 1 | 4 | 5 | Neutral |
1 | 1 | 4 | 6 | Neutral |
1 | 1 | 4 | 7 | Neutral |
1 | 1 | 4 | 8 | Scared |
1 | 1 | 4 | 10 | Joyful |
1 | 1 | 4 | 11 | Joyful |
1 | 1 | 4 | 12 | Sad |
scatter plots
relational data
dealing with overplotting
bubble chart
The collective term for multiple tables of (related) data
can easily be combined thanks to joins (from dplyr
)
The collective term for multiple tables of (related) data
can easily be combined thanks to joins (from dplyr
)
Mutating joins: adds new variables (columns) to a data frame based on matching observations in another
possible through keys: variables that uniquely identify observations
friends_emotions
season | episode | scene | utterance | emotion |
---|---|---|---|---|
1 | 1 | 4 | 1 | Mad |
1 | 1 | 4 | 3 | Neutral |
1 | 1 | 4 | 4 | Joyful |
1 | 1 | 4 | 5 | Neutral |
1 | 1 | 4 | 6 | Neutral |
1 | 1 | 4 | 7 | Neutral |
1 | 1 | 4 | 8 | Scared |
1 | 1 | 4 | 10 | Joyful |
1 | 1 | 4 | 11 | Joyful |
1 | 1 | 4 | 12 | Sad |
friends_joyful_sad <- friends_emotions %>% group_by(season, episode, emotion) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(emotion %in% c("Joyful","Sad")) %>% select(-c(count, n)) %>% pivot_wider(names_from = emotion, values_from = percent) %>% mutate(Sad = replace_na(Sad, 0))
season | episode | emotion | count |
---|---|---|---|
1 | 1 | Joyful | 15 |
1 | 1 | Mad | 17 |
1 | 1 | Neutral | 34 |
1 | 1 | Peaceful | 8 |
1 | 1 | Powerful | 4 |
1 | 1 | Sad | 14 |
1 | 1 | Scared | 9 |
1 | 2 | Joyful | 18 |
1 | 2 | Mad | 27 |
1 | 2 | Neutral | 42 |
friends_joyful_sad <- friends_emotions %>% group_by(season, episode, emotion) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(emotion %in% c("Joyful","Sad")) %>% select(-c(count, n)) %>% pivot_wider(names_from = emotion, values_from = percent) %>% mutate(Sad = replace_na(Sad, 0))
season | episode | emotion | count | n | percent |
---|---|---|---|---|---|
1 | 1 | Joyful | 15 | 101 | 0.1485149 |
1 | 1 | Mad | 17 | 101 | 0.1683168 |
1 | 1 | Neutral | 34 | 101 | 0.3366337 |
1 | 1 | Peaceful | 8 | 101 | 0.0792079 |
1 | 1 | Powerful | 4 | 101 | 0.0396040 |
1 | 1 | Sad | 14 | 101 | 0.1386139 |
1 | 1 | Scared | 9 | 101 | 0.0891089 |
1 | 2 | Joyful | 18 | 132 | 0.1363636 |
1 | 2 | Mad | 27 | 132 | 0.2045455 |
1 | 2 | Neutral | 42 | 132 | 0.3181818 |
friends_joyful_sad <- friends_emotions %>% group_by(season, episode, emotion) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(emotion %in% c("Joyful","Sad")) %>% select(-c(count, n)) %>% pivot_wider(names_from = emotion, values_from = percent) %>% mutate(Sad = replace_na(Sad, 0))
season | episode | emotion | percent |
---|---|---|---|
1 | 1 | Joyful | 0.1485149 |
1 | 1 | Sad | 0.1386139 |
1 | 2 | Joyful | 0.1363636 |
1 | 2 | Sad | 0.0681818 |
1 | 3 | Joyful | 0.1338583 |
1 | 3 | Sad | 0.0787402 |
1 | 4 | Joyful | 0.2675159 |
1 | 4 | Sad | 0.0318471 |
1 | 5 | Joyful | 0.2179487 |
1 | 5 | Sad | 0.0705128 |
friends_joyful_sad <- friends_emotions %>% group_by(season, episode, emotion) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(emotion %in% c("Joyful","Sad")) %>% select(-c(count, n)) %>% pivot_wider(names_from = emotion, values_from = percent) %>% mutate(Sad = replace_na(Sad, 0))
season | episode | Joyful | Sad |
---|---|---|---|
1 | 1 | 0.1485149 | 0.1386139 |
1 | 2 | 0.1363636 | 0.0681818 |
1 | 3 | 0.1338583 | 0.0787402 |
1 | 4 | 0.2675159 | 0.0318471 |
1 | 5 | 0.2179487 | 0.0705128 |
1 | 6 | 0.1666667 | |
1 | 7 | 0.3437500 | 0.0312500 |
1 | 8 | 0.1710526 | 0.0263158 |
1 | 9 | 0.1855670 | 0.0412371 |
1 | 10 | 0.2641509 | 0.0377358 |
friends_joyful_sad <- friends_emotions %>% group_by(season, episode, emotion) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(emotion %in% c("Joyful","Sad")) %>% select(-c(count, n)) %>% pivot_wider(names_from = emotion, values_from = percent) %>% mutate(Sad = replace_na(Sad, 0))
season | episode | Joyful | Sad |
---|---|---|---|
1 | 1 | 0.1485149 | 0.1386139 |
1 | 2 | 0.1363636 | 0.0681818 |
1 | 3 | 0.1338583 | 0.0787402 |
1 | 4 | 0.2675159 | 0.0318471 |
1 | 5 | 0.2179487 | 0.0705128 |
1 | 6 | 0.1666667 | 0.0000000 |
1 | 7 | 0.3437500 | 0.0312500 |
1 | 8 | 0.1710526 | 0.0263158 |
1 | 9 | 0.1855670 | 0.0412371 |
1 | 10 | 0.2641509 | 0.0377358 |
friends_info %>% left_join(friends_joyful_sad, by = c("episode","season"))
friends_info %>% left_join(friends_joyful_sad, by = c("episode","season"))
season | episode | title | us_views_millions | imdb_rating | Joyful | Sad |
---|---|---|---|---|---|---|
1 | 1 | The Pilot | 21.5 | 8.3 | 0.1485149 | 0.1386139 |
1 | 2 | The One with the Sonogram at the End | 20.2 | 8.1 | 0.1363636 | 0.0681818 |
1 | 3 | The One with the Thumb | 19.5 | 8.2 | 0.1338583 | 0.0787402 |
1 | 4 | The One with George Stephanopoulos | 19.7 | 8.1 | 0.2675159 | 0.0318471 |
1 | 5 | The One with the East German Laundry Detergent | 18.6 | 8.5 | 0.2179487 | 0.0705128 |
friends_info %>% left_join(friends_joyful_sad, by = c("episode","season")) %>% filter(season <= 4) %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = Joyful)) + geom_jitter()
friends_info %>% left_join(friends_joyful_sad, by = c("episode","season")) %>% filter(season <= 4) %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = Joyful)) + geom_jitter() + scale_colour_gradient(low = "#fafafa", high = "#191970", breaks = seq(0.1, 0.4, 0.1))
friends_info %>% left_join(friends_joyful_sad, by = c("episode","season")) %>% filter(season <= 4) %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = Sad)) + geom_jitter() + scale_colour_gradient(low = "#fafafa", high = "#191970", breaks = seq(0, 0.25, 0.05))
How does the relationship between viewers and IMDB rating look by:
season
predominant emotion of the episode
focus character
friends
text | speaker | season | episode | scene | utterance |
---|---|---|---|---|---|
There's nothing to tell! He's just some guy I work with! | Monica Geller | 1 | 1 | 1 | 1 |
C'mon, you're going out with the guy! There's gotta be something wrong with him! | Joey Tribbiani | 1 | 1 | 1 | 2 |
All right Joey, be nice. So does he have a hump? A hump and a hairpiece? | Chandler Bing | 1 | 1 | 1 | 3 |
Wait, does he eat chalk? | Phoebe Buffay | 1 | 1 | 1 | 4 |
(They all stare, bemused.) | Scene Directions | 1 | 1 | 1 | 5 |
Just, 'cause, I don't want her to go through what I went through with Carl- oh! | Phoebe Buffay | 1 | 1 | 1 | 6 |
friends_top_actor <- friends %>% group_by(season, episode, speaker) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(speaker %in% c("Chandler Bing","Joey Tribbiani", "Monica Geller","Phoebe Buffay", "Rachel Green","Ross Geller")) %>% filter(percent == max(percent)) %>% select(-c(count, n, percent))
season | episode | speaker | count | n | percent |
---|---|---|---|---|---|
1 | 1 | #ALL# | 8 | 323 | 0.0247678 |
1 | 1 | Chandler Bing | 39 | 323 | 0.1207430 |
1 | 1 | Customer | 1 | 323 | 0.0030960 |
1 | 1 | Franny | 5 | 323 | 0.0154799 |
1 | 1 | Joey Tribbiani | 39 | 323 | 0.1207430 |
1 | 1 | Monica Geller | 73 | 323 | 0.2260062 |
1 | 1 | Paul the Wine Guy | 17 | 323 | 0.0526316 |
1 | 1 | Phoebe Buffay | 19 | 323 | 0.0588235 |
1 | 1 | Priest On Tv | 1 | 323 | 0.0030960 |
1 | 1 | Rachel Green | 48 | 323 | 0.1486068 |
friends_top_actor <- friends %>% group_by(season, episode, speaker) %>% summarize(count = n()) %>% add_count(wt = count) %>% mutate(percent = count / n) %>% filter(speaker %in% c("Chandler Bing","Joey Tribbiani", "Monica Geller","Phoebe Buffay", "Rachel Green","Ross Geller")) %>% filter(percent == max(percent)) %>% select(-c(count, n, percent))
season | episode | speaker | count | n | percent |
---|---|---|---|---|---|
1 | 1 | Monica Geller | 73 | 323 | 0.2260062 |
1 | 2 | Ross Geller | 68 | 258 | 0.2635659 |
1 | 3 | Monica Geller | 52 | 285 | 0.1824561 |
1 | 4 | Monica Geller | 47 | 260 | 0.1807692 |
1 | 5 | Ross Geller | 40 | 267 | 0.1498127 |
1 | 6 | Chandler Bing | 58 | 238 | 0.2436975 |
1 | 7 | Ross Geller | 53 | 258 | 0.2054264 |
1 | 8 | Ross Geller | 61 | 258 | 0.2364341 |
1 | 9 | Monica Geller | 48 | 251 | 0.1912351 |
1 | 10 | Phoebe Buffay | 51 | 257 | 0.1984436 |
friends_info %>% left_join(friends_top_actor, by = c("episode","season")) %>%
friends_info %>% left_join(friends_top_actor, by = c("episode","season")) %>%
season | episode | title | us_views_millions | imdb_rating | speaker |
---|---|---|---|---|---|
1 | 1 | The Pilot | 21.5 | 8.3 | Monica Geller |
1 | 2 | The One with the Sonogram at the End | 20.2 | 8.1 | Ross Geller |
1 | 3 | The One with the Thumb | 19.5 | 8.2 | Monica Geller |
1 | 4 | The One with George Stephanopoulos | 19.7 | 8.1 | Monica Geller |
1 | 5 | The One with the East German Laundry Detergent | 18.6 | 8.5 | Ross Geller |
friends_info %>% left_join(friends_top_actor, by = c("episode","season")) %>% ggplot(aes(x = us_views_millions, y = imdb_rating, color = speaker)) + geom_jitter(size = 2)
scatter plots
relational data
dealing with overplotting
bubble chart
When the sample size is large, overplotting can disguise trends
Techniques:
use smaller dots and/or transparency
add color by group
add jittering
When the sample size is large, overplotting can disguise trends
Techniques:
use smaller dots and/or transparency
add color by group
add jittering
add a rug plot
txhousing %>% filter(month == 1 & listings < 20000) %>% ggplot(aes(x = median, y = listings)) + geom_point()
txhousing %>% filter(month == 1 & listings < 20000) %>% ggplot(aes(x = median, y = listings)) + geom_point() + geom_rug(color = "purple", alpha = 0.1, size = 2)
When the sample size is large, overplotting can disguise trends
Techniques:
use smaller dots and/or transparency
add color by group
add jittering
add a rug plot
add a marginal distribution
When the sample size is large, overplotting can disguise trends
Techniques:
use smaller dots and/or transparency
add color by group
add jittering
add a rug plot
add a marginal distribution
create a hexbin
txhousing %>% filter(month == 1 & listings < 20000) %>% ggplot(aes(x = median, y = listings)) + geom_hex(bins = 30)
storms %>% ggplot(aes(x = wind, y = pressure)) + geom_hex(bins = 30)
scatter plots
relational data
dealing with overplotting
bubble chart
Adding a third quant. variable to size
Not generally recommended!
Why?
Adding a third quant. variable to size
Not generally recommended!
Why?
- encoding the same type of variable (numeric) on two different scales: position and size
Adding a third quant. variable to size
Not generally recommended!
Why?
- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
Adding a third quant. variable to size
Not generally recommended!
Why?
- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size
Adding a third quant. variable to size
Not generally recommended!
Why?
- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size
- hard to see small differences in size
Adding a third quant. variable to size
Not generally recommended!
Why?
- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size
- hard to see small differences in size
- difficult to match scale of circle size to scale of difference
penguins %>% filter(species != "Gentoo") %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
penguins %>% filter(species != "Gentoo") %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g, size = bill_length_mm)) + geom_point()
penguins %>% filter(species != "Gentoo") %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g, size = bill_length_mm)) + geom_point(alpha = 0.5, color = "red") + scale_size(range = c(0.1, 7), breaks = c(35, 40, 45, 50, 55)) + theme(legend.position = "top")
penguins %>% filter(species != "Gentoo") %>% mutate(label = case_when(flipper_length_mm == 192 & body_mass_g == 2700 ~ "Chinstrap", flipper_length_mm == 184 & body_mass_g == 4650 ~ "Adelie")) %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g, size = bill_length_mm, color = species)) + geom_point(alpha = 0.5) + scale_size(range = c(0.1, 7), breaks = c(35, 40, 45, 50, 55), name = "Bill Length (mm)") + geom_label_repel(aes(x = flipper_length_mm, y = body_mass_g, color = species, label = label), inherit.aes = FALSE) + scale_color_discrete(guide = "none") + theme(legend.position = "top")
Graphic critique due before midterm
Details on syllabus
Homework 2 due Tuesday June 8
Lecture 7 on Monday June 7
Line graphs
Various techniques and considerations
Working with time
lubridate
package
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |