class: center, middle, inverse, title-slide # CONJ620: CM 1.2 ## Introduction to the Tidyverse ### Alison Hill --- ## I'm assuming... You have working versions of: - R and - RStudio installed on your computer. --- ## Today's in-class lab - Our notepad: http://bit.ly/conj620-cm012 -- - [Graphics and statistics for cardiology: comparing categorical and continuous variables](http://heart.bmj.com/content/early/2016/01/27/heartjnl-2015-308104.full) - [Full text](http://faculty.washington.edu/kenrice/heartgraphs/effectivegraphs.pdf) - doi:10.1136/heartjnl-2015-308104 - Site: http://faculty.washington.edu/kenrice/heartgraphs/ -- - The data is from [NHANES](https://www.cdc.gov/nchs/nhanes/index.htm): the CDC's National Health and Nutrition Examination Survey --- ## Set-up I provide this code in your [lab worksheet](../labs/cm012-worksheet.html) ```r library(tidyverse) heart <- read_csv("http://faculty.washington.edu/kenrice/heartgraphs/nhanesmedium.csv", na = ".") ``` Some notes: - I'm going to ask that you trust me with this worksheet! You haven't learned about this kind of document yet, called an R Markdown (`.Rmd`) file- please just go with it! I promise we'll actually explain it in a later lab 😇 - Don't forget the notepad!: http://bit.ly/conj620-cm012 --- ## Data Dictionary From the [data dictionary](http://faculty.washington.edu/kenrice/heartgraphs/): * `BPXSAR`: systolic blood pressure (mmHg) * `BPXDAR`: diastolic blood pressure (mmHg) * `BPXDI1`, `BPXDI2`: two diastolic blood pressure readings * `race_ethc`: race/ethnicity, coded as: - Hispanic, - White non-Hispanic, - Black non-Hispanic and - Other * `gender`: sex, coded as Male/Female * `DR1TFOLA`: folate intake (μg/day) * `RIAGENDR`: sex, coded as 1/2 * `BMXBMI`: body mass index (kg/m2) * `RIDAGEY`: age (years) --- ## Chapter 1: Data Wrangling .pull-left[ - print a tibble - `heart` - install a package - `install.packages("dplyr")` - do 1x per machine - load an installed package - `library(dplyr)` - do 1x per work session - assign a variable a name (`<-`) ] -- .pull-right[ - `dplyr::filter` - `dplyr::arrange` - `dplyr::mutate` ] --- class: middle, center, inverse ![](images/rladylego-pipe.jpg) ## Plus: `%>%` *image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)* --- class: center, middle, inverse # `%>%` ## The pipe *"dataframe first, dataframe once"* -- ```r library(dplyr) ``` -- RStudio Keyboard Shortcuts: OSX: `CMD` + `SHIFT` + `M` Else: `CTRL` + `SHIFT` + `M` --- class: middle *Nesting* a dataframe inside a function is hard to read. ```r slice(heart, 1) ``` ``` # A tibble: 1 x 10 BPXSAR BPXDAR BPXDI1 BPXDI2 race_ethc gender DR1TFOLA RIAGENDR BMXBMI <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 129. 50.7 48. 48. White non-H… Male 334. 1. 19.7 # ... with 1 more variable: RIDAGEYR <dbl> ``` -- Here, the "sentence" starts with a <font color="#ED1941">verb</font>. -- <hr> *Piping* a dataframe into a function lets you read L to R ```r heart %>% slice(1) ``` ``` # A tibble: 1 x 10 BPXSAR BPXDAR BPXDI1 BPXDI2 race_ethc gender DR1TFOLA RIAGENDR BMXBMI <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 129. 50.7 48. 48. White non-H… Male 334. 1. 19.7 # ... with 1 more variable: RIDAGEYR <dbl> ``` -- Now, the "sentence" starts with a <font color="#ED1941">noun</font>. --- class: middle Sequences of functions make you read *inside out* ```r slice(filter(heart, gender == "Male"), 1) ``` ``` # A tibble: 1 x 10 BPXSAR BPXDAR BPXDI1 BPXDI2 race_ethc gender DR1TFOLA RIAGENDR BMXBMI <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 129. 50.7 48. 48. White non-H… Male 334. 1. 19.7 # ... with 1 more variable: RIDAGEYR <dbl> ``` -- <hr> Chaining functions together lets you read *L to R* ```r heart %>% filter(gender == "Male") %>% slice(1) ``` ``` # A tibble: 1 x 10 BPXSAR BPXDAR BPXDI1 BPXDI2 race_ethc gender DR1TFOLA RIAGENDR BMXBMI <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 129. 50.7 48. 48. White non-H… Male 334. 1. 19.7 # ... with 1 more variable: RIDAGEYR <dbl> ``` --- class: inverse, middle, center <img src="https://www.rstudio.com/wp-content/uploads/2014/04/magrittr.png" width="50%" style="display: block; margin: auto;" /> ## "dataframe first, dataframe once" --- class: middle ```r heart %>% filter(gender == "Male") %>% slice(1) ``` ``` # A tibble: 1 x 10 BPXSAR BPXDAR BPXDI1 BPXDI2 race_ethc gender DR1TFOLA RIAGENDR BMXBMI <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> 1 129. 50.7 48. 48. White non-H… Male 334. 1. 19.7 # ... with 1 more variable: RIDAGEYR <dbl> ``` -- <hr> This does the same thing: ```r heart %>% filter(.data = ., gender == "Male") %>% slice(.data = ., 1) ``` -- <hr> So does this: ```r heart %>% filter(., gender == "Male") %>% slice(., 1) ``` --- class: middle, center, inverse ![](https://media.giphy.com/media/PD0V8W0JWcYPm/giphy.gif) `attach()` `heart$gender` or other variants --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `filter` --- class: inverse, bottom, center background-image: url("images/peapod.png") background-size: 25% ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Logical Operators --- ```r ?base::Logic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> & </td> <td style="text-align:left;"> and </td> <td style="text-align:left;"> x & y </td> </tr> <tr> <td style="text-align:left;"> | </td> <td style="text-align:left;"> or </td> <td style="text-align:left;"> x | y </td> </tr> <tr> <td style="text-align:left;"> xor </td> <td style="text-align:left;"> exactly x or y </td> <td style="text-align:left;"> xor(x, y) </td> </tr> <tr> <td style="text-align:left;"> ! </td> <td style="text-align:left;"> not </td> <td style="text-align:left;"> !x </td> </tr> </tbody> </table> --- Logical or (`|`) is inclusive, so `x | y` really means: * x or * y or * both x & y Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means: * x or * y... * but not both x & y ```r x <- c(0, 1, 0, 1) y <- c(0, 0, 1, 1) boolean_or <- x | y exclusive_or <- xor(x, y) cbind(x, y, boolean_or, exclusive_or) ``` ``` x y boolean_or exclusive_or [1,] 0 0 0 0 [2,] 1 0 1 1 [3,] 0 1 1 1 [4,] 1 1 1 0 ``` --- class: middle, center, inverse #💡 ## Second: ## Comparisons --- ```r ?Comparison ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> < </td> <td style="text-align:left;"> less than </td> <td style="text-align:left;"> x < y </td> </tr> <tr> <td style="text-align:left;"> <= </td> <td style="text-align:left;"> less than or equal to </td> <td style="text-align:left;"> x <= y </td> </tr> <tr> <td style="text-align:left;"> > </td> <td style="text-align:left;"> greater than </td> <td style="text-align:left;"> x > y </td> </tr> <tr> <td style="text-align:left;"> >= </td> <td style="text-align:left;"> greater than or equal to </td> <td style="text-align:left;"> x >= y </td> </tr> <tr> <td style="text-align:left;"> == </td> <td style="text-align:left;"> exactly equal to </td> <td style="text-align:left;"> x == y </td> </tr> <tr> <td style="text-align:left;"> != </td> <td style="text-align:left;"> not equal to </td> <td style="text-align:left;"> x != y </td> </tr> <tr> <td style="text-align:left;"> %in% </td> <td style="text-align:left;"> group membership* </td> <td style="text-align:left;"> x %in% y </td> </tr> <tr> <td style="text-align:left;"> is.na </td> <td style="text-align:left;"> is missing </td> <td style="text-align:left;"> is.na(x) </td> </tr> <tr> <td style="text-align:left;"> !is.na </td> <td style="text-align:left;"> is not missing </td> <td style="text-align:left;"> !is.na(x) </td> </tr> </tbody> </table> *(shortcut to using `|` repeatedly with `==`) --- class: middle, center, inverse # ⌛️ ## Let's review `mutate` --- # 3 ways to `mutate` 1. <font color="#ED1941">Create a new variable with a specific value</font> 1. Create a new variable based on other variables 1. Change an existing variable -- ```r heart_bp <- heart %>% select(BPXDI1, BPXDI2) heart_bp %>% mutate(year = 2015) ``` ``` # A tibble: 200 x 3 BPXDI1 BPXDI2 year <dbl> <dbl> <dbl> 1 48. 48. 2015. 2 76. 78. 2015. 3 76. 76. 2015. 4 64. 56. 2015. 5 54. 56. 2015. 6 80. 78. 2015. 7 52. NA 2015. 8 NA 80. 2015. 9 76. NA 2015. 10 90. 80. 2015. # ... with 190 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. <font color="#ED1941">Create a new variable based on other variables</font> 1. Change an existing variable -- ```r heart_bp %>% mutate(bp_ratio = BPXDI1 / BPXDI2) ``` ``` # A tibble: 200 x 3 BPXDI1 BPXDI2 bp_ratio <dbl> <dbl> <dbl> 1 48. 48. 1.00 2 76. 78. 0.974 3 76. 76. 1.00 4 64. 56. 1.14 5 54. 56. 0.964 6 80. 78. 1.03 7 52. NA NA 8 NA 80. NA 9 76. NA NA 10 90. 80. 1.12 # ... with 190 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. Create a new variable based on other variables 1. <font color="#ED1941">Change an existing variable</font> -- ```r heart_bp %>% mutate(bp_ratio = bp_ratio * 100) ``` ``` # A tibble: 200 x 3 BPXDI1 BPXDI2 bp_ratio <dbl> <dbl> <dbl> 1 48. 48. 100. 2 76. 78. 97.4 3 76. 76. 100. 4 64. 56. 114. 5 54. 56. 96.4 6 80. 78. 103. 7 52. NA NA 8 NA 80. NA 9 76. NA NA 10 90. 80. 112. # ... with 190 more rows ``` --- class: middle, inverse ## To make your `mutate` "stick" ```r heart_bp <- heart_bp %>% mutate(bp_ratio = BPXDI1 / BPXDI2) ``` --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `mutate` --- class: inverse, bottom, center background-image: url("images/peapod.png") background-size: 25% ## Remember: ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Arithmetic *especially useful for* `mutate` See: http://r4ds.had.co.nz/transform.html#mutate-funs --- ```r ?Arithmetic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> + </td> <td style="text-align:left;"> addition </td> <td style="text-align:left;"> x + y </td> </tr> <tr> <td style="text-align:left;"> - </td> <td style="text-align:left;"> subtraction </td> <td style="text-align:left;"> x - y </td> </tr> <tr> <td style="text-align:left;"> * </td> <td style="text-align:left;"> multiplication </td> <td style="text-align:left;"> x * y </td> </tr> <tr> <td style="text-align:left;"> / </td> <td style="text-align:left;"> division </td> <td style="text-align:left;"> x / y </td> </tr> <tr> <td style="text-align:left;"> ^ </td> <td style="text-align:left;"> raised to the power of </td> <td style="text-align:left;"> x ^ y </td> </tr> <tr> <td style="text-align:left;"> abs </td> <td style="text-align:left;"> absolute value </td> <td style="text-align:left;"> abs(x) </td> </tr> <tr> <td style="text-align:left;"> %/% </td> <td style="text-align:left;"> integer division </td> <td style="text-align:left;"> x %/% y </td> </tr> <tr> <td style="text-align:left;"> %% </td> <td style="text-align:left;"> remainder after division </td> <td style="text-align:left;"> x %% y </td> </tr> </tbody> </table> ```r 5 %/% 2 # 2 goes into 5 two times with... ``` ``` [1] 2 ``` ```r 5 %% 2 # 1 left over ``` ``` [1] 1 ``` --- ## Chapter 2: Data Visualization all `ggplot2` - `aes(x = , y = )` (aesthetics) - `aes(x = , y = , color = )` (add color) - `aes(x = , y = , size = )` (add size) - `+ facet_wrap(~ )` (facetting) --- ## Old School <sup>1</sup> - Sketch the graphics below on paper, where the `x`-axis is variable `age_yrs` and the `y`-axis is variable `systolic_bp` ``` # A tibble: 4 x 4 age_yrs systolic_bp bmi_z gender <dbl> <dbl> <dbl> <chr> 1 8. 80. 1. male 2 9. 90. 2. male 3 10. 100. 3. female 4 11. 110. 4. female ``` <!-- Copy to chalkboard/whiteboard --> 1. A scatter plot 1. A scatter plot where the `color` of the points corresponds to `gender` 1. A scatter plot where the `size` of the points corresponds to `bmi_z` 1. A scatter plot facetted by `gender` .footnote[ [1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html) ] --- ## 1. A scatterplot ```r library(ggplot2) ggplot(nhanes, aes(age_yrs, systolic_bp)) + geom_point() ``` -- <img src="cm012_files/figure-html/unnamed-chunk-29-1.png" width="65%" style="display: block; margin: auto;" /> --- ## 2. `color` points by `gender` ```r library(ggplot2) ggplot(nhanes, aes(age_yrs, systolic_bp, color = gender)) + geom_point() ``` -- <img src="cm012_files/figure-html/unnamed-chunk-31-1.png" width="65%" style="display: block; margin: auto;" /> --- ## 3. `size` points by `bmi_z` ```r library(ggplot2) ggplot(nhanes, aes(age_yrs, systolic_bp, size = bmi_z)) + geom_point() ``` -- <img src="cm012_files/figure-html/unnamed-chunk-33-1.png" width="65%" style="display: block; margin: auto;" /> --- ## 4. `facet_wrap` by `gender` ```r library(ggplot2) ggplot(nhanes, aes(age_yrs, systolic_bp)) + geom_point() + facet_wrap(~gender) ``` -- <img src="cm012_files/figure-html/unnamed-chunk-35-1.png" width="65%" style="display: block; margin: auto;" /> --- class: middle, inverse, center ## Work your way through the lab worksheet now View it [here](../labs/cm012-worksheet.html) Download the file on Sakai to work on locally --- class: middle, inverse # 🔨 ## More Resources - [RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) - [RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) - [RStudio Base R Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/base-r.pdf) - [Alison's OHSU Data Jamboree using `ggplot2`](https://alison.rbind.io/talk/code-your-graph/)