+ - 0:00:00
Notes for current slide
Notes for next slide

36-315: Statistical Graphics and Visualization

Lecture 6

Meghan Hall
Department of Statistics & Data Science
Carnegie Mellon University
June 4, 2021

1

From last time


Line graphs
Various techniques and considerations


Working with time
lubridate package

2

Updates

Homework
due on Tuesday
instructions!

3

Today


Scatter plots
Considerations, overplotting, line of best fit


Relational data
Practicing joins with dplyr

4

Today's agenda


  1. scatter plots

  2. relational data

  3. dealing with overplotting

  4. bubble chart

5

Today's agenda


  1. scatter plots

  2. relational data

  3. dealing with overplotting

  4. bubble chart

6

The purpose of a scatter plot



To study a relationship between two numeric variables
can also view by a group (categorical variable)
and sometimes with a third numeric variable

7

The purpose of a scatter plot



To study a relationship between two numeric variables
can also view by a group (categorical variable)
and sometimes with a third numeric variable


Line graphs (from Wednesday) are just a special kind of scatter plot
with a chronological variable (or proxy of one) on the x
and lines connecting the points to emphasize trends

7

Today's data

friends

friends_info
friends_emotions
friends

8

Questions to examine



How does the relationship between viewers and IMDB rating look by:

season

predominant emotion of the episode

focus character

9

Today's data


friends_info

season episode title us_views_millions imdb_rating
1 1 The Pilot 21.5 8.3
1 2 The One with the Sonogram at the End 20.2 8.1
1 3 The One with the Thumb 19.5 8.2
1 4 The One with George Stephanopoulos 19.7 8.1
1 5 The One with the East German Laundry Detergent 18.6 8.5
1 6 The One with the Butt 18.2 8.1
1 7 The One with the Blackout 23.5 9.0
1 8 The One Where Nana Dies Twice 21.1 8.1
1 9 The One Where Underdog Gets Away 23.1 8.2
1 10 The One with the Monkey 19.9 8.1
10

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_point()

11

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_point(alpha = 0.5, color = "red", size = 2)

12

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_jitter()

13

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = season)) +
geom_jitter()

14

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating,
color = as.character(season))) +
geom_jitter()

15

Basic scatter plot

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = season)) +
geom_jitter(size = 2) +
scale_colour_gradient(low = "#fafafa", high = "#191970",
breaks = seq(1, 10, 1))

16

Scatter plot with best-fit line

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_jitter(size = 2) +
geom_smooth(method = "lm")

17

Scatter plot with best-fit line

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_jitter(size = 2) +
geom_smooth(method = "lm", se = FALSE)

18

Scatter plot with best-fit line

friends_info %>%
ggplot(aes(x = us_views_millions, y = imdb_rating)) +
geom_jitter(size = 2) +
geom_smooth(method = "lm", level = 0.99,
color = "purple", fill = "#DCD0FF")

19

Questions to examine



How does the relationship between viewers and IMDB rating look by:

season

predominant emotion of the episode

focus character

20

Today's data


friends_emotions

season episode scene utterance emotion
1 1 4 1 Mad
1 1 4 3 Neutral
1 1 4 4 Joyful
1 1 4 5 Neutral
1 1 4 6 Neutral
1 1 4 7 Neutral
1 1 4 8 Scared
1 1 4 10 Joyful
1 1 4 11 Joyful
1 1 4 12 Sad
21

Today's agenda


  1. scatter plots

  2. relational data

  3. dealing with overplotting

  4. bubble chart

22

Relational data



The collective term for multiple tables of (related) data
can easily be combined thanks to joins (from dplyr)

23

Relational data



The collective term for multiple tables of (related) data
can easily be combined thanks to joins (from dplyr)


Mutating joins: adds new variables (columns) to a data frame based on matching observations in another
possible through keys: variables that uniquely identify observations

23

Today's data

diagram1

24

Today's data


friends_emotions

season episode scene utterance emotion
1 1 4 1 Mad
1 1 4 3 Neutral
1 1 4 4 Joyful
1 1 4 5 Neutral
1 1 4 6 Neutral
1 1 4 7 Neutral
1 1 4 8 Scared
1 1 4 10 Joyful
1 1 4 11 Joyful
1 1 4 12 Sad
25

Data manipulation

friends_joyful_sad <- friends_emotions %>%
group_by(season, episode, emotion) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(emotion %in% c("Joyful","Sad")) %>%
select(-c(count, n)) %>%
pivot_wider(names_from = emotion, values_from = percent) %>%
mutate(Sad = replace_na(Sad, 0))
26

Data manipulation

season episode emotion count
1 1 Joyful 15
1 1 Mad 17
1 1 Neutral 34
1 1 Peaceful 8
1 1 Powerful 4
1 1 Sad 14
1 1 Scared 9
1 2 Joyful 18
1 2 Mad 27
1 2 Neutral 42
27

Data manipulation

friends_joyful_sad <- friends_emotions %>%
group_by(season, episode, emotion) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(emotion %in% c("Joyful","Sad")) %>%
select(-c(count, n)) %>%
pivot_wider(names_from = emotion, values_from = percent) %>%
mutate(Sad = replace_na(Sad, 0))
28

Data manipulation

season episode emotion count n percent
1 1 Joyful 15 101 0.1485149
1 1 Mad 17 101 0.1683168
1 1 Neutral 34 101 0.3366337
1 1 Peaceful 8 101 0.0792079
1 1 Powerful 4 101 0.0396040
1 1 Sad 14 101 0.1386139
1 1 Scared 9 101 0.0891089
1 2 Joyful 18 132 0.1363636
1 2 Mad 27 132 0.2045455
1 2 Neutral 42 132 0.3181818
29

Data manipulation

friends_joyful_sad <- friends_emotions %>%
group_by(season, episode, emotion) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(emotion %in% c("Joyful","Sad")) %>%
select(-c(count, n)) %>%
pivot_wider(names_from = emotion, values_from = percent) %>%
mutate(Sad = replace_na(Sad, 0))
30

Data manipulation

season episode emotion percent
1 1 Joyful 0.1485149
1 1 Sad 0.1386139
1 2 Joyful 0.1363636
1 2 Sad 0.0681818
1 3 Joyful 0.1338583
1 3 Sad 0.0787402
1 4 Joyful 0.2675159
1 4 Sad 0.0318471
1 5 Joyful 0.2179487
1 5 Sad 0.0705128
31

Data manipulation

friends_joyful_sad <- friends_emotions %>%
group_by(season, episode, emotion) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(emotion %in% c("Joyful","Sad")) %>%
select(-c(count, n)) %>%
pivot_wider(names_from = emotion, values_from = percent) %>%
mutate(Sad = replace_na(Sad, 0))
32

Data manipulation

season episode Joyful Sad
1 1 0.1485149 0.1386139
1 2 0.1363636 0.0681818
1 3 0.1338583 0.0787402
1 4 0.2675159 0.0318471
1 5 0.2179487 0.0705128
1 6 0.1666667
1 7 0.3437500 0.0312500
1 8 0.1710526 0.0263158
1 9 0.1855670 0.0412371
1 10 0.2641509 0.0377358
33

Data manipulation

friends_joyful_sad <- friends_emotions %>%
group_by(season, episode, emotion) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(emotion %in% c("Joyful","Sad")) %>%
select(-c(count, n)) %>%
pivot_wider(names_from = emotion, values_from = percent) %>%
mutate(Sad = replace_na(Sad, 0))
34

Data manipulation

season episode Joyful Sad
1 1 0.1485149 0.1386139
1 2 0.1363636 0.0681818
1 3 0.1338583 0.0787402
1 4 0.2675159 0.0318471
1 5 0.2179487 0.0705128
1 6 0.1666667 0.0000000
1 7 0.3437500 0.0312500
1 8 0.1710526 0.0263158
1 9 0.1855670 0.0412371
1 10 0.2641509 0.0377358
35

Joining data

friends_info %>%
left_join(friends_joyful_sad, by = c("episode","season"))
36

Joining data

friends_info %>%
left_join(friends_joyful_sad, by = c("episode","season"))
season episode title us_views_millions imdb_rating Joyful Sad
1 1 The Pilot 21.5 8.3 0.1485149 0.1386139
1 2 The One with the Sonogram at the End 20.2 8.1 0.1363636 0.0681818
1 3 The One with the Thumb 19.5 8.2 0.1338583 0.0787402
1 4 The One with George Stephanopoulos 19.7 8.1 0.2675159 0.0318471
1 5 The One with the East German Laundry Detergent 18.6 8.5 0.2179487 0.0705128
36

Basic scatter plot

friends_info %>%
left_join(friends_joyful_sad, by = c("episode","season")) %>%
filter(season <= 4) %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = Joyful)) +
geom_jitter()

37

Basic scatter plot

friends_info %>%
left_join(friends_joyful_sad, by = c("episode","season")) %>%
filter(season <= 4) %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = Joyful)) +
geom_jitter() +
scale_colour_gradient(low = "#fafafa", high = "#191970",
breaks = seq(0.1, 0.4, 0.1))

38

Basic scatter plot

friends_info %>%
left_join(friends_joyful_sad, by = c("episode","season")) %>%
filter(season <= 4) %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = Sad)) +
geom_jitter() +
scale_colour_gradient(low = "#fafafa", high = "#191970",
breaks = seq(0, 0.25, 0.05))

39

Questions to examine



How does the relationship between viewers and IMDB rating look by:

season

predominant emotion of the episode

focus character

40

Today's data


friends

text speaker season episode scene utterance
There's nothing to tell! He's just some guy I work with! Monica Geller 1 1 1 1
C'mon, you're going out with the guy! There's gotta be something wrong with him! Joey Tribbiani 1 1 1 2
All right Joey, be nice. So does he have a hump? A hump and a hairpiece? Chandler Bing 1 1 1 3
Wait, does he eat chalk? Phoebe Buffay 1 1 1 4
(They all stare, bemused.) Scene Directions 1 1 1 5
Just, 'cause, I don't want her to go through what I went through with Carl- oh! Phoebe Buffay 1 1 1 6
41

Today's data

diagram2

42

Today's data

diagram3

43

Data manipulation

friends_top_actor <- friends %>%
group_by(season, episode, speaker) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(speaker %in% c("Chandler Bing","Joey Tribbiani",
"Monica Geller","Phoebe Buffay",
"Rachel Green","Ross Geller")) %>%
filter(percent == max(percent)) %>%
select(-c(count, n, percent))
44

Data manipulation

season episode speaker count n percent
1 1 #ALL# 8 323 0.0247678
1 1 Chandler Bing 39 323 0.1207430
1 1 Customer 1 323 0.0030960
1 1 Franny 5 323 0.0154799
1 1 Joey Tribbiani 39 323 0.1207430
1 1 Monica Geller 73 323 0.2260062
1 1 Paul the Wine Guy 17 323 0.0526316
1 1 Phoebe Buffay 19 323 0.0588235
1 1 Priest On Tv 1 323 0.0030960
1 1 Rachel Green 48 323 0.1486068
45

Data manipulation

friends_top_actor <- friends %>%
group_by(season, episode, speaker) %>%
summarize(count = n()) %>%
add_count(wt = count) %>%
mutate(percent = count / n) %>%
filter(speaker %in% c("Chandler Bing","Joey Tribbiani",
"Monica Geller","Phoebe Buffay",
"Rachel Green","Ross Geller")) %>%
filter(percent == max(percent)) %>%
select(-c(count, n, percent))
46

Data manipulation

season episode speaker count n percent
1 1 Monica Geller 73 323 0.2260062
1 2 Ross Geller 68 258 0.2635659
1 3 Monica Geller 52 285 0.1824561
1 4 Monica Geller 47 260 0.1807692
1 5 Ross Geller 40 267 0.1498127
1 6 Chandler Bing 58 238 0.2436975
1 7 Ross Geller 53 258 0.2054264
1 8 Ross Geller 61 258 0.2364341
1 9 Monica Geller 48 251 0.1912351
1 10 Phoebe Buffay 51 257 0.1984436
47

Joining data

friends_info %>%
left_join(friends_top_actor, by = c("episode","season")) %>%
48

Joining data

friends_info %>%
left_join(friends_top_actor, by = c("episode","season")) %>%
season episode title us_views_millions imdb_rating speaker
1 1 The Pilot 21.5 8.3 Monica Geller
1 2 The One with the Sonogram at the End 20.2 8.1 Ross Geller
1 3 The One with the Thumb 19.5 8.2 Monica Geller
1 4 The One with George Stephanopoulos 19.7 8.1 Monica Geller
1 5 The One with the East German Laundry Detergent 18.6 8.5 Ross Geller
48

Basic scatter plot

friends_info %>%
left_join(friends_top_actor, by = c("episode","season")) %>%
ggplot(aes(x = us_views_millions, y = imdb_rating, color = speaker)) +
geom_jitter(size = 2)

49

Today's agenda


  1. scatter plots

  2. relational data

  3. dealing with overplotting

  4. bubble chart

50

Overplotting

When the sample size is large, overplotting can disguise trends

Techniques:
use smaller dots and/or transparency

add color by group

add jittering

51

Overplotting

When the sample size is large, overplotting can disguise trends

Techniques:
use smaller dots and/or transparency

add color by group

add jittering

add a rug plot

51

Overplotting

txhousing %>%
filter(month == 1 & listings < 20000) %>%
ggplot(aes(x = median, y = listings)) +
geom_point()

52

Overplotting

txhousing %>%
filter(month == 1 & listings < 20000) %>%
ggplot(aes(x = median, y = listings)) +
geom_point() +
geom_rug(color = "purple", alpha = 0.1, size = 2)

53

Overplotting

When the sample size is large, overplotting can disguise trends

Techniques:
use smaller dots and/or transparency

add color by group

add jittering

add a rug plot

add a marginal distribution

54

Overplotting

55

Overplotting

When the sample size is large, overplotting can disguise trends

Techniques:
use smaller dots and/or transparency

add color by group

add jittering

add a rug plot

add a marginal distribution

create a hexbin

56

Overplotting

txhousing %>%
filter(month == 1 & listings < 20000) %>%
ggplot(aes(x = median, y = listings)) +
geom_hex(bins = 30)

57

Overplotting

storms %>%
ggplot(aes(x = wind, y = pressure)) +
geom_hex(bins = 30)

58

Today's agenda


  1. scatter plots

  2. relational data

  3. dealing with overplotting

  4. bubble chart

59

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

60

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

- encoding the same type of variable (numeric) on two different scales: position and size

60

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations

60

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size

60

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size
- hard to see small differences in size

60

Bubble chart


Adding a third quant. variable to size

Not generally recommended!

Why?

- encoding the same type of variable (numeric) on two different scales: position and size
- hard to compare the strengths of different associations
- much easier to perceive differences when encoded by position rather than size
- hard to see small differences in size
- difficult to match scale of circle size to scale of difference

60

Bubble chart

penguins %>%
filter(species != "Gentoo") %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()

61

Bubble chart

penguins %>%
filter(species != "Gentoo") %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
size = bill_length_mm)) +
geom_point()

62

Bubble chart

penguins %>%
filter(species != "Gentoo") %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
size = bill_length_mm)) +
geom_point(alpha = 0.5, color = "red") +
scale_size(range = c(0.1, 7), breaks = c(35, 40, 45, 50, 55)) +
theme(legend.position = "top")

63

Bubble chart

penguins %>%
filter(species != "Gentoo") %>%
mutate(label = case_when(flipper_length_mm == 192 &
body_mass_g == 2700 ~ "Chinstrap",
flipper_length_mm == 184 &
body_mass_g == 4650 ~ "Adelie")) %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
size = bill_length_mm, color = species)) +
geom_point(alpha = 0.5) +
scale_size(range = c(0.1, 7), breaks = c(35, 40, 45, 50, 55),
name = "Bill Length (mm)") +
geom_label_repel(aes(x = flipper_length_mm, y = body_mass_g,
color = species, label = label),
inherit.aes = FALSE) +
scale_color_discrete(guide = "none") +
theme(legend.position = "top")
64

Bubble chart

65

Upcoming


Graphic critique due before midterm
Details on syllabus


Homework 2 due Tuesday June 8


Lecture 7 on Monday June 7

66

From last time


Line graphs
Various techniques and considerations


Working with time
lubridate package

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow