class: inverse, center, middle # 36-315: Statistical Graphics and Visualization ## Lecture 12 Meghan Hall <br> Department of Statistics & Data Science <br> Carnegie Mellon University <br> June 21, 2021 --- layout: true <div class="my-footer"><span>cmu-36315.netlify.app</span></div> --- # Midterm <br> .large[Solution has been posted] <br> .large[Grades soon] <br> .large[Will go over a few commonly-missed concepts during lab tomorrow] --- # Schedule going forward <br> .large[Lab 8 tomorrow] <br> .medium[HW 4 due too] <br> .large[Group project check-in] <br> .medium[due on **Friday**] <br> .large[Last HW due *next* Tuesday] --- # Group project details <br> .large[Assigned groups of 3] -- <br> .large[Data needs to be publicly available] <br> .medium[possible sources: [538](https://data.fivethirtyeight.com/), [Kaggle](https://www.kaggle.com/datasets), [TidyTuesday](https://github.com/rfordatascience/tidytuesday)] -- <br> .large[Report *and* presentation] <br> .medium[help available from instructor/TAs] --- # Group project report .large[Must be made with R Markdown] <br> .medium[submit .Rmd/.html] -- <br> .large[Include an overview of the data and ~2-3 questions addressed comprehensively] <br> .medium[need to include several different types of plots] <br> .medium[thorough summaries] <br> .medium[conclusions & further work, can address limitations] -- <br> .large[Plots must go together] <br> .medium[same fonts, colors, etc. (great time to use a theme!)] <br> .medium[some customization is necessary] --- # Questions to address .large[Once] <br> .medium[what are the details of your data?] <br> .medium[what observation level? missing data?] -- <br> .large[For each plot/question] <br> .medium[run-down of the data prep steps] <br> .medium[thorough summaries] <br> .medium[what are the limitations of this analysis? what other data would you incorporate if you could?] -- <br> .large[Everything must be in plain English] <br> .medium["Data was aggregated by region" instead of "I used `group_by(region)`"] --- # Group project presentation <br> .large[Must be made with RStudio/R Markdown] <br> .medium[using **xaringan** (Friday's lecture)] -- <br> .large[~5 minutes] <br> .medium[everyone needs to speak] <br> .medium[must include overview of data] <br> .medium[plots & conclusions] -- <br> .large[Presentations on July 1 & 2] <br> .medium[will find a convenient time for each group] <br> .medium[sign-ups next week] --- # Group project deadlines <br> .large[Check-in] <br> .medium[due **this** Friday on Canvas] <br> .medium[a few paragraphs: quick summary on your collaboration strategy, your thoughts on a data set selection, etc.] -- <br> .large[Presentations] <br> .medium[**next** Thursday and Friday] -- <br> .large[Final reports] <br> .medium[**next** Friday, 11:30am] --- # Sample collaboration strategy .large[Decide on a data set together] <br> .large[Everyone is responsible for creating/addressing their own question] <br> .medium[with a few plots & plenty of supporting statements] <br> .large[Two people handle the creation of the presentation] <br> .large[While the other handles the report & streamlines graphs] --- # Today <br> .large[Trying out `tidymodels`] <br> .medium[creating, evaluating, visualizing models] <br> .large[Experimenting with user-defined functions] --- # Disclaimer <br> .large[This isn't a class on modeling] <br> .medium[designing good models: not our focus] <br> .medium[today's models are very basic!] <br> .medium[won't be asked to create models on HW/labs] -- <br> .large[This *is* a class on visualization] <br> .medium[and it's handy to visualize model outputs/comparison] <br> .medium[`tidymodels` plays well with `tidyverse` and `ggplot2`] --- # Today's data .center[`nhl2021_players`] .center[`nhl2021_events`] .center[![logo](figs/Lec12/nhl_logo.png)] --- # Questions to answer .large[Can we predict a player's position (forward or defenseman) from their time on ice?] <br> .medium[what if we add in their rate of scoring points and their rate of blocking shots?] <br> .medium[**logistic regression**] <br> -- .large[Is there a difference in the rate of scoring points by position?] <br> .medium[**hypothesis testing**] <br> -- .large[Can we predict a player's rate of scoring points by:] <br> .medium[their position, their time on ice, and their rate of blocking shots?] <br> .medium[**linear regression**] --- # Today's data .center[`nhl2021_players`] <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Season </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Position </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> GP </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A.J. Greer </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> L </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8.55 </td> </tr> <tr> <td style="text-align:left;"> Aaron Ekblad </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 878.18 </td> </tr> <tr> <td style="text-align:left;"> Aaron Ness </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12.83 </td> </tr> <tr> <td style="text-align:left;"> Adam Boqvist </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 594.65 </td> </tr> <tr> <td style="text-align:left;"> Adam Brooks </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 117.45 </td> </tr> </tbody> </table> --- # Today's data .center[`nhl2021_events`] <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> game_id </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> game_date </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> event_index </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> event_type </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> event_player_1 </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> event_player_2 </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> event_player_3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2020020001 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> BLOCK </td> <td style="text-align:left;"> SEAN.COUTURIER </td> <td style="text-align:left;"> BRIAN.DUMOULIN </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:right;"> 2020020001 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:left;"> BLOCK </td> <td style="text-align:left;"> MARK.JANKOWSKI </td> <td style="text-align:left;"> ROBERT.HAGG </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:right;"> 2020020001 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:left;"> BLOCK </td> <td style="text-align:left;"> BRANDON.TANEV </td> <td style="text-align:left;"> NOLAN.PATRICK </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:right;"> 2020020001 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:left;"> BLOCK </td> <td style="text-align:left;"> BRANDON.TANEV </td> <td style="text-align:left;"> OSKAR.LINDBLOM </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:right;"> 2020020001 </td> <td style="text-align:left;"> 2021-01-13 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:left;"> GOAL </td> <td style="text-align:left;"> MARK.JANKOWSKI </td> <td style="text-align:left;"> BRANDON.TANEV </td> <td style="text-align:left;"> JARED.MCCANN </td> </tr> </tbody> </table> --- # Data prep steps <br> .large[Where do we want to be at the end?] <br> .large[One row per player, with the variables we need:] <br> .medium[**position**: from `nhl2021_players`] <br> .medium[**time on ice**: from `nhl2021_players`] <br> .medium[**scoring rate**: from `nhl2021_events`] <br> .medium[**blocked shots**: from `nhl2021_events`] --- .h1[# Data prep steps] <br> .large[ 1. Summarize the shots blocked from `nhl2021_events` 2. Summarize the points from `nhl2021_events` 3. Combine 1 and 2 4. Join `nhl2021_events` into `nhl2021_players` ] --- # Step 1 summarize the shots blocked from `nhl2021_events` <br> can use the `name` argument within `count` to rename the default `n` ```r blocks <- nhl2021_events %>% filter(event_type == "BLOCK") %>% count(event_player_2, name = "blocks") %>% rename(player = event_player_2) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> player </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:right;"> 27 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:right;"> 29 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BROOKS </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> ADAM.ERNE </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:right;"> 102 </td> </tr> </tbody> </table> --- # Step 2 summarize the points from `nhl2021_events` <br> all `event_player`s on a goal get a point, so pivoting is necessary ```r points <- nhl2021_events %>% select(event_type, event_player_1:event_player_3) %>% filter(event_type == "GOAL") %>% pivot_longer(event_player_1:event_player_3, names_to = NULL, values_to = "player", values_drop_na = TRUE) %>% count(player, name = "points") ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> player </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BROOKS </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ADAM.ERNE </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:right;"> 47 </td> </tr> </tbody> </table> --- # Step 3 when we want to combine **all** combinations from two data frames, use `full_join` <br> `mutate_if` is a shortcut to apply the same function to multiple variables ```r events <- full_join(blocks, points, by = "player") %>% mutate_if(is.numeric, replace_na, replace = 0) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> player </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BROOKS </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> ADAM.ERNE </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:right;"> 102 </td> <td style="text-align:right;"> 47 </td> </tr> </tbody> </table> --- # Step 4 need to reformat the `Player` variable to match ```r final_data <- nhl2021_players %>% mutate(Player = str_to_upper(Player), Player = str_replace(Player, " ", ".")) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Season </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Position </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> GP </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A.J..GREER </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> L </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8.55 </td> </tr> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 878.18 </td> </tr> <tr> <td style="text-align:left;"> AARON.NESS </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12.83 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 594.65 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BROOKS </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 117.45 </td> </tr> </tbody> </table> --- # Step 4 can now join in our events data <br> `mutate_at` is another efficient option ```r final_data <- nhl2021_players %>% mutate(Player = str_to_upper(Player), Player = str_replace(Player, " ", ".")) %>% left_join(events, by = c("Player" = "player")) %>% mutate_at(c("points", "blocks"), replace_na, replace = 0) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Season </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Position </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> GP </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A.J..GREER </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> L </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8.55 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 878.18 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> AARON.NESS </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 12.83 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 594.65 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BROOKS </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 117.45 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> --- # Step 4 filter to only those who played >= 1/3 of a season <br> create some rate variables for better comparison ```r final_data <- nhl2021_players %>% mutate(Player = str_to_upper(Player), Player = str_replace(Player, " ", ".")) %>% left_join(events, by = c("Player" = "player")) %>% mutate_at(c("points", "blocks"), replace_na, replace = 0) %>% filter(GP >= 18) %>% mutate(points_60 = points * 60 / TOI, blocks_60 = blocks * 60 / TOI, TOI_game = TOI / GP) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Season </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Position </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> GP </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points_60 </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks_60 </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI_game </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 878.18 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 1.503109 </td> <td style="text-align:right;"> 1.844724 </td> <td style="text-align:right;"> 25.09086 </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 594.65 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 1.614395 </td> <td style="text-align:right;"> 2.926091 </td> <td style="text-align:right;"> 16.99000 </td> </tr> <tr> <td style="text-align:left;"> ADAM.ERNE </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> L </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 626.45 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1.915556 </td> <td style="text-align:right;"> 1.532445 </td> <td style="text-align:right;"> 13.92111 </td> </tr> <tr> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:left;"> 20-21 </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 1358.85 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 2.075284 </td> <td style="text-align:right;"> 4.503808 </td> <td style="text-align:right;"> 24.70636 </td> </tr> </tbody> </table> --- # Step 4 change our outcome variable to a factor (easier for logistic regression) <br> eliminate unnecessary variables ```r final_data <- nhl2021_players %>% mutate(Player = str_to_upper(Player), Player = str_replace(Player, " ", ".")) %>% left_join(events, by = c("Player" = "player")) %>% mutate_at(c("points", "blocks"), replace_na, replace = 0) %>% filter(GP >= 18) %>% mutate(points_60 = points * 60 / TOI, blocks_60 = blocks * 60 / TOI, TOI_game = TOI / GP) %>% mutate(defense = as.factor(ifelse(Position == "D", "D", "F"))) %>% select(-c(Position, Season, points, blocks, TOI, GP)) ``` -- <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points_60 </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks_60 </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI_game </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> defense </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:right;"> 1.503109 </td> <td style="text-align:right;"> 1.844724 </td> <td style="text-align:right;"> 25.09086 </td> <td style="text-align:left;"> D </td> </tr> <tr> <td style="text-align:left;"> ADAM.BOQVIST </td> <td style="text-align:right;"> 1.614395 </td> <td style="text-align:right;"> 2.926091 </td> <td style="text-align:right;"> 16.99000 </td> <td style="text-align:left;"> D </td> </tr> </tbody> </table> --- # Exploring our data ```r final_data %>% ggplot(aes(x = TOI_game, fill = defense)) + * geom_density(alpha = 0.8, color = NA) + scale_fill_brewer(type = "qual", name = NULL) + * guides(fill = guide_legend(override.aes = list(alpha = 1))) + scale_x_continuous(expand = expansion(mult = c(0.01, 0.01))) + scale_y_continuous(expand = expansion(mult = c(0, 0))) + labs(x = "Time on ice per game (min)") + cmu_theme() + theme(legend.position = c(0.9, 0.9)) ``` --- # Exploring our data <img src="figs/Lec12/explore-1-1.png" width="504" style="display: block; margin: auto;" /> --- # User-defined functions <br> .large[What if we wanted to replicate that code for our other variables of interest?] <br> .medium[could just copy our code, or:] -- <br> .large[User-defined functions:] <br> .medium[useful any time you're tempted to replicate code] <br> .medium[avoids mistakes from copying and pasting] <br> .medium[easy to apply your changes everywhere] --- # User-defined functions ```r *density_fn <- function(x_var) { final_data %>% ggplot(aes(x = x_var, fill = defense)) + geom_density(alpha = 0.8, color = NA) + scale_fill_brewer(type = "qual", name = NULL) + guides(fill = guide_legend(override.aes = list(alpha = 1))) + scale_x_continuous(expand = expansion(mult = c(0.01, 0.01))) + scale_y_continuous(expand = expansion(mult = c(0, 0.03))) + cmu_theme() + theme(legend.position = c(0.9, 0.9)) } ``` --- # User-defined functions ```r density_fn(final_data$TOI_game) + labs(x = "Time on ice per game (min)") ``` <img src="figs/Lec12/explore-3-1.png" width="504" style="display: block; margin: auto;" /> --- # User-defined functions ```r density_fn(final_data$blocks_60) + labs(x = "Shots blocked per 60 minutes") ``` <img src="figs/Lec12/explore-4-1.png" width="504" style="display: block; margin: auto;" /> --- # User-defined functions ```r density_fn(final_data$points_60) + labs(x = "Points per 60 min") ``` <img src="figs/Lec12/explore-5-1.png" width="504" style="display: block; margin: auto;" /> --- # Exploring our data <img src="figs/Lec12/explore-6-1.png" width="504" style="display: block; margin: auto;" /> --- # Exploring our data ```r final_data %>% ggplot(aes(x = TOI_game, y = blocks_60, color = defense)) + geom_point(alpha = 0.7, size = 2.5) + scale_color_brewer(type = "qual", name = NULL) + labs(x = "Time on ice per game (min)", y = "Shots blocked per 60 min") + guides(color = guide_legend(override.aes = * list(alpha = 1, size = 5))) + cmu_theme() + theme(legend.position = c(0.9, 0.8)) ``` --- # Exploring our data <img src="figs/Lec12/explore-7-1.png" width="504" style="display: block; margin: auto;" /> --- # First steps for modeling split the data into **training** data and **testing** data ```r # this ensures that your samples are reproducible set.seed(1234) split_data <- initial_split(data = final_data, # sets the proportion for training prop = 0.7, strata = defense) ``` -- why is `strata = defense` necessary? --- .h1[# First steps for modeling] .tiny[ ```r final_data %>% mutate(defense = recode(defense, "D" = "defensemen", "F" = "forwards")) %>% ggplot(aes(x = defense, fill = defense)) + geom_bar() + scale_fill_brewer(type = "qual", name = NULL) + scale_y_continuous(expand = expansion(mult = c(0, 0.05))) + cmu_theme() + theme(legend.position = "none", axis.title.x = element_blank()) ``` <img src="figs/Lec12/prep-2-1.png" width="504" style="display: block; margin: auto;" /> ] --- # First steps for modeling split the data into **training** data and **testing** data ```r training_data <- training(split_data) testing_data <- testing(split_data) ``` --- # Questions to answer .large[Can we predict a player's position (forward or defenseman) from their time on ice?] <br> .medium[what if we add in their rate of scoring points and their rate of blocking shots?] <br> .medium[**logistic regression**] --- # The modeling workflow *for informational purposes only!* ```r TOI_only <- recipe(defense ~ TOI_game, data = training_data) %>% prep() ``` what can you do in a recipe? - create dummy variables - scale/normalize/transform variables - remove missing data - automatically remove variables that are highly correlated --- # The modeling workflow *for informational purposes only!* ```r TOI_only <- recipe(defense ~ TOI_game, data = training_data) %>% prep() TOI_only ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 1 ## ## Training data contained 462 data points and no missing data. ``` --- # The modeling workflow *for informational purposes only!* ```r TOI_only <- recipe(defense ~ TOI_game, data = training_data) %>% prep() summary(TOI_only) ``` ``` ## # A tibble: 2 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 TOI_game numeric predictor original ## 2 defense nominal outcome original ``` --- # The modeling workflow *for informational purposes only!* ```r logistic <- logistic_reg(mode = "classification") %>% set_engine("glm") workflow_TOI <- workflow() %>% add_model(logistic) %>% add_recipe(TOI_only) TOI_fit <- workflow_TOI %>% fit(data = training_data) ``` --- # The modeling workflow *for informational purposes only!* ```r TOI_fit %>% pull_workflow_fit() %>% tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 7.26 0.704 10.3 6.62e-25 ## 2 TOI_game -0.388 0.0400 -9.70 3.01e-22 ``` --- # The modeling workflow *for informational purposes only!* normally you'd use cross-validation with training data (split into assessment & analysis) to compare models and *simulate* performance with new data, then save the test set for evaluating *actual* performance with new data here we're going straight for the test set ```r TOI_predictions_class <- predict(TOI_fit, testing_data, type = "class") %>% bind_cols(testing_data %>% select(Player, defense, TOI_game)) TOI_predictions <- predict(TOI_fit, testing_data, type = "prob") %>% bind_cols(TOI_predictions_class) %>% mutate(correct = .pred_class == defense, prediction = ifelse(defense == "D", .pred_D, .pred_F)) ``` --- # The results what do our results look like? good to have both `.pred_class` and `.pred_D` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_D </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_F </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_class </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> defense </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI_game </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> correct </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> prediction </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.9224885 </td> <td style="text-align:right;"> 0.0775115 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 25.09086 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9224885 </td> </tr> <tr> <td style="text-align:right;"> 0.9111255 </td> <td style="text-align:right;"> 0.0888745 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 24.70636 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9111255 </td> </tr> <tr> <td style="text-align:right;"> 0.7124500 </td> <td style="text-align:right;"> 0.2875500 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> ADAM.PELECH </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 21.04643 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.7124500 </td> </tr> <tr> <td style="text-align:right;"> 0.3342578 </td> <td style="text-align:right;"> 0.6657422 </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> ADRIAN.KEMPE </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 16.93250 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.6657422 </td> </tr> <tr> <td style="text-align:right;"> 0.0771476 </td> <td style="text-align:right;"> 0.9228524 </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> ALEX.BARABANOV </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 12.31227 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9228524 </td> </tr> </tbody> </table> --- .left.h1[# The results] .pull-left.tiny[ ```r TOI_predictions %>% ggplot(aes(x = defense, y = TOI_game, color = .pred_class, shape = correct)) + geom_jitter(size = 4) + scale_shape_manual(values = c("triangle", "circle"), name = NULL) + scale_color_brewer(type = "qual", name = "Predicted position") + guides(color = guide_legend( override.aes = list(size = 5))) + labs(y = "Time on ice per game (min)", x = "Actual position") + cmu_theme() + theme(legend.position = "top", axis.title = element_text( size = 14), legend.text = element_text( size = 13)) ``` ] -- .pull-right[ <img src="figs/Lec12/results-2-1.png" width="504" style="display: block; margin: auto;" /> ] --- # The modeling workflow *for informational purposes only!* for our `TOI_plus` model ```r TOI_plus <- recipe(defense ~ TOI_game + points_60 + blocks_60, data = training_data) %>% prep() workflow_TOI_plus <- workflow() %>% add_model(logistic) %>% add_recipe(TOI_plus) TOI_plus_fit <- workflow_TOI_plus %>% fit(data = training_data) ``` --- # The modeling workflow *for informational purposes only!* ```r TOI_plus_predictions_class <- predict(TOI_plus_fit, testing_data, type = "class") %>% bind_cols(testing_data %>% select(Player, defense, TOI_game, points_60, blocks_60)) TOI_plus_predictions <- predict(TOI_plus_fit, testing_data, type = "prob") %>% bind_cols(TOI_plus_predictions_class) %>% mutate(correct = .pred_class == defense, prediction = ifelse(defense == "D", .pred_D, .pred_F)) ``` --- # The results what do our results look like? good to have both `.pred_class` and `.pred_D` <table class="table" style="font-size: 13px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_D </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_F </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .pred_class </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> Player </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> defense </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> TOI_game </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> points_60 </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> blocks_60 </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> correct </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> prediction </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.9962503 </td> <td style="text-align:right;"> 0.0037497 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> AARON.EKBLAD </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 25.09086 </td> <td style="text-align:right;"> 1.503109 </td> <td style="text-align:right;"> 1.844724 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9962503 </td> </tr> <tr> <td style="text-align:right;"> 0.9891216 </td> <td style="text-align:right;"> 0.0108784 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> ADAM.FOX </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 24.70636 </td> <td style="text-align:right;"> 2.075284 </td> <td style="text-align:right;"> 4.503808 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9891216 </td> </tr> <tr> <td style="text-align:right;"> 0.9990881 </td> <td style="text-align:right;"> 0.0009119 </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> ADAM.PELECH </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 21.04643 </td> <td style="text-align:right;"> 0.712710 </td> <td style="text-align:right;"> 3.716273 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9990881 </td> </tr> <tr> <td style="text-align:right;"> 0.0222108 </td> <td style="text-align:right;"> 0.9777892 </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> ADRIAN.KEMPE </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 16.93250 </td> <td style="text-align:right;"> 1.835017 </td> <td style="text-align:right;"> 1.328806 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9777892 </td> </tr> <tr> <td style="text-align:right;"> 0.0016860 </td> <td style="text-align:right;"> 0.9983140 </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> ALEX.BARABANOV </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 12.31227 </td> <td style="text-align:right;"> 1.772068 </td> <td style="text-align:right;"> 3.101119 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 0.9983140 </td> </tr> </tbody> </table> --- .left.h1[# The results] .pull-left.tiny[ ```r TOI_plus_predictions %>% ggplot(aes(x = defense, y = TOI_game, color = .pred_class, shape = correct)) + geom_jitter(size = 4) + scale_shape_manual(values = c("triangle", "circle"), name = NULL) + scale_color_brewer(type = "qual", name = "Predicted position") + guides(color = guide_legend( override.aes = list(size = 5))) + labs(y = "Time on ice per game (min)", x = "Actual position") + cmu_theme() + theme(legend.position = "top", axis.title = element_text( size = 14), legend.text = element_text( size = 13)) ``` ] -- .pull-right[ <img src="figs/Lec12/results-4-1.png" width="504" style="display: block; margin: auto;" /> ] --- .h1[# Metrics to evaluate and compare] .large[ 1. confusion matrix 2. ROC curve 3. density curves 4. AUC/log loss/accuracy ] --- # Confusion matrix small heat map with predicted value against actual ```r confusion_matrix <- function(data_set) { data_set %>% * conf_mat(defense, .pred_class) %>% pluck(1) %>% as_tibble() %>% * ggplot(aes(x = Prediction, y = Truth, alpha = n)) + geom_tile() + geom_text(aes(label = n), colour = "white", alpha = 1, size = 8) + cmu_theme() + theme(panel.grid.major = element_blank(), legend.position = "none") + labs(y = "Actual Position", x = "Predicted Position", title = "Confusion Matrix") } ``` --- # Confusion matrix ```r confusion_matrix(TOI_predictions) + labs(subtitle = "TOI Only Model") ``` <img src="figs/Lec12/evaluate-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Confusion matrix ```r confusion_matrix(TOI_plus_predictions) + labs(subtitle = "TOI-Plus Model") ``` <img src="figs/Lec12/evaluate-3-1.png" width="504" style="display: block; margin: auto;" /> --- # ROC curve plots false positive against true positive shows how good the model is at classifying aim is to maximize the area under it ```r ROC_curve <- function(dataset) { dataset %>% * roc_curve(defense, .pred_D) %>% * ggplot(aes(x = 1 - specificity, y = sensitivity)) + geom_path() + geom_abline(linetype = 3) + coord_equal() + labs(y = "True Positive Rate (Sensitivity)", x = "False Positive Rate", title = "ROC Curve") + cmu_theme() } ``` --- # ROC curve ```r ROC_curve(TOI_predictions) + labs(subtitle = "TOI Only Model") ``` <img src="figs/Lec12/evaluate-5-1.png" width="504" style="display: block; margin: auto;" /> --- # ROC curve ```r ROC_curve(TOI_plus_predictions) + labs(subtitle = "TOI-Plus Model") ``` <img src="figs/Lec12/evaluate-6-1.png" width="504" style="display: block; margin: auto;" /> --- # Density curves show densities for predicted values, colored by actual ones ```r density_pred_fn <- function(predictions) { predictions %>% ggplot(aes(x = .pred_D, fill = defense)) + geom_density(alpha = 0.5, color = NA) + scale_fill_brewer(type = "qual", name = "Actual\nPosition") + labs(x = "Predicted Defense") + cmu_theme() + theme(legend.position = c(0.8, 0.8)) } ``` --- # Density curves ```r density_pred_fn(TOI_predictions) + labs(title = "TOI Only Model") ``` <img src="figs/Lec12/evaluate-8-1.png" width="504" style="display: block; margin: auto;" /> --- # Density curves ```r density_pred_fn(TOI_plus_predictions) + labs(title = "TOI-Plus Model") ``` <img src="figs/Lec12/evaluate-9-1.png" width="504" style="display: block; margin: auto;" /> --- # Individual metrics accuracy: how many predictions were right? ```r TOI_predictions %>% accuracy(defense, .pred_class) ``` <table class="table" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .metric </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimator </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> accuracy </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.765 </td> </tr> </tbody> </table> -- log loss: how good were the predictions themselves? ```r TOI_predictions %>% mn_log_loss(defense, .pred_D) ``` <table class="table" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .metric </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimator </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mn_log_loss </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.4735617 </td> </tr> </tbody> </table> --- # Individual metrics AUC: what was the area under that ROC curve? ```r TOI_predictions %>% roc_auc(defense, .pred_D) ``` <table class="table" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .metric </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimator </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> roc_auc </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.8175134 </td> </tr> </tbody> </table> --- # Individual metrics ```r metrics <- metric_set(accuracy, mn_log_loss, roc_auc) metrics_all <- bind_rows( metrics(TOI_predictions, truth = defense, estimate = .pred_class, .pred_D) %>% mutate(model = "TOI only"), metrics(TOI_plus_predictions, truth = defense, estimate = .pred_class, .pred_D) %>% mutate(model = "TOI plus") ) ``` --- # Comparing metrics <table class="table" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .metric </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimator </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> .estimate </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> model </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> accuracy </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.7650000 </td> <td style="text-align:left;"> TOI only </td> </tr> <tr> <td style="text-align:left;"> mn_log_loss </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.4735617 </td> <td style="text-align:left;"> TOI only </td> </tr> <tr> <td style="text-align:left;"> roc_auc </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.8175134 </td> <td style="text-align:left;"> TOI only </td> </tr> <tr> <td style="text-align:left;"> accuracy </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.9600000 </td> <td style="text-align:left;"> TOI plus </td> </tr> <tr> <td style="text-align:left;"> mn_log_loss </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.1084734 </td> <td style="text-align:left;"> TOI plus </td> </tr> <tr> <td style="text-align:left;"> roc_auc </td> <td style="text-align:left;"> binary </td> <td style="text-align:right;"> 0.9911988 </td> <td style="text-align:left;"> TOI plus </td> </tr> </tbody> </table> --- # Comparing metrics <img src="figs/Lec12/evaluate-16-1.png" width="504" style="display: block; margin: auto;" /> --- .h1[# Comparing metrics] .tiny[ ```r metrics_all %>% * mutate(label = case_when(.metric == "accuracy" ~ "Accuracy\n(higher is better)", * .metric == "roc_auc" ~ "Area Under Curve\n(higher is better)", * .metric == "mn_log_loss" ~ "Log Loss\n(lower is better)")) %>% ggplot(aes(fill = model, y = .estimate, x = model)) + geom_bar(stat = "identity") + scale_fill_brewer(type = "qual", palette = 2) + scale_y_continuous(expand = expansion(mult = c(0, .1))) + cmu_theme() + labs(y = "Value", x = NULL, fill = NULL, title = "Comparing Our Models") + facet_wrap(~label) + * geom_text(aes(label = round(.estimate, 3)), vjust = -0.5, * size = 5, position = position_dodge(width= 0.9)) + theme(legend.position = "bottom", axis.text.x = element_blank(), legend.margin = margin(t = -3, b = -3), strip.background = element_rect(fill = "light grey"), strip.text = element_text(color = "black", size = 12)) ``` ] --- # Questions to answer .large[Can we predict a player's position (forward or defenseman) from their time on ice?] <br> .medium[what if we add in their rate of scoring points and their rate of blocking shots?] <br> .medium[**logistic regression**] <br> .large[Is there a difference in the rate of scoring points by position?] <br> .medium[**hypothesis testing**] <br> --- # Hypothesis testing ```r null_distribution <- final_data %>% specify(response = points_60, explanatory = defense) %>% hypothesize(null = "independence") %>% generate(reps = 500, type = "permute") %>% calculate("diff in means", order = c("D", "F")) ``` --- # Hypothesis testing ```r null_distribution %>% visualize() + cmu_theme() ``` <img src="figs/Lec12/hypothesis-2-1.png" width="504" style="display: block; margin: auto;" /> --- # Hypothesis testing ```r estimate <- final_data %>% specify(response = points_60, explanatory = defense) %>% calculate("diff in means", order = c("D", "F")) ``` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> stat </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -0.8836989 </td> </tr> </tbody> </table> --- # Hypothesis testing ```r null_distribution %>% visualize() + shade_p_value(obs_stat = estimate, direction = "two_sided") + cmu_theme() ``` <img src="figs/Lec12/hypothesis-5-1.png" width="504" style="display: block; margin: auto;" /> --- # Questions to answer .large[Can we predict a player's position (forward or defenseman) from their time on ice?] <br> .medium[what if we add in their rate of scoring points and their rate of blocking shots?] <br> .medium[**logistic regression**] <br> .large[Is there a difference in the rate of scoring points by position?] <br> .medium[**hypothesis testing**] <br> .large[Can we predict a player's rate of scoring points by:] <br> .medium[their position, their time on ice, and their rate of blocking shots?] <br> .medium[**linear regression**] --- # Linear regression ```r linear <- linear_reg(mode = "regression") %>% set_engine("lm") linear_recipe <- recipe(points_60 ~ TOI_game + blocks_60 + defense, data = training_data) %>% step_dummy(all_nominal_predictors()) %>% prep() ``` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> variable </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> type </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> role </th> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> source </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> TOI_game </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> blocks_60 </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> points_60 </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> outcome </td> <td style="text-align:left;"> original </td> </tr> <tr> <td style="text-align:left;"> defense_F </td> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> predictor </td> <td style="text-align:left;"> derived </td> </tr> </tbody> </table> --- # Linear regression ```r workflow_linear <- workflow() %>% add_model(linear) %>% add_recipe(linear_recipe) workflow_linear ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## ● step_dummy() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- # Linear regression ```r linear_fit <- workflow_linear %>% fit(data = training_data) ``` <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> term </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> estimate </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> std.error </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> statistic </th> <th style="text-align:right;font-weight: bold;color: white !important;background-color: #bb0000 !important;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -0.9262828 </td> <td style="text-align:right;"> 0.1849154 </td> <td style="text-align:right;"> -5.009224 </td> <td style="text-align:right;"> 8e-07 </td> </tr> <tr> <td style="text-align:left;"> TOI_game </td> <td style="text-align:right;"> 0.1324807 </td> <td style="text-align:right;"> 0.0077190 </td> <td style="text-align:right;"> 17.163004 </td> <td style="text-align:right;"> 0e+00 </td> </tr> <tr> <td style="text-align:left;"> blocks_60 </td> <td style="text-align:right;"> -0.1665361 </td> <td style="text-align:right;"> 0.0224687 </td> <td style="text-align:right;"> -7.411898 </td> <td style="text-align:right;"> 0e+00 </td> </tr> <tr> <td style="text-align:left;"> defenseF </td> <td style="text-align:right;"> 1.0478876 </td> <td style="text-align:right;"> 0.0777103 </td> <td style="text-align:right;"> 13.484536 </td> <td style="text-align:right;"> 0e+00 </td> </tr> </tbody> </table> --- # Linear regression ```r linear_predictions <- predict(linear_fit, new_data = testing_data) %>% bind_cols(testing_data) %>% mutate(residual = points_60 - .pred) rsq(linear_predictions, truth = points_60, estimate = .pred) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rsq standard 0.584 ``` --- .h1[# Linear regression] .tiny[ <img src="figs/Lec12/regression-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Linear regression ```r linear_predictions %>% ggplot(aes(x = .pred, y = points_60, color = defense)) + geom_point(size = 3) + scale_color_brewer(type = "qual", name = NULL) + geom_abline(intercept = 0, slope = 1, color = "black") + guides(color = guide_legend(override.aes = list(size = 5))) + cmu_theme() + * annotate("text", x = 0.5, y = 2.5, label = "R^2 == 0.584", * parse = TRUE) + labs(title = "Linear Regression Results", x = "Predicted Points Rate", y = "Actual Points Rate") + theme(legend.position = c(0.3, 0.8)) ``` --- .h1[# Linear regression] .tiny[ <img src="figs/Lec12/regression-6-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Linear regression ```r linear_predictions %>% ggplot(aes(x = .pred, y = residual, color = defense)) + geom_point(size = 3) + scale_color_brewer(type = "qual", name = NULL) + geom_hline(yintercept = 0, color = "black") + guides(color = guide_legend(override.aes = list(size = 5))) + coord_equal() + cmu_theme() + labs(title = "Linear Regression Results", x = "Predicted Points Rate", y = "Residual") + theme(legend.position = c(0.1, 0.1)) ``` --- # Upcoming <br> .large[Homework 4 due Tuesday June 22] <br> .large[Lab 8 on Tuesday June 22] <br> .large[Lecture 13 on Wednesday June 23] <br> .medium[creating tables and analyzing text data]