html
file is due on Sakai by beginning of class Tuesday July 31th (2:30pm).We’ll work with data from this 538 article. In the article, the authors describe collecting data on key socioeconomic factors for each state, including indicators for:
In this lab, we’ll use a subset of these variables to predict hate crimes in the US. There are two possible outcome variables here: (1) pre-election data from the FBI, and (2) post-election data from the Southern Poverty Law Center. We’ll focus on the pre-election data in this lab.
This data is included in the fivethirtyeight
package in the hate_crimes
data frame, which we’ll refer to as the “Hate crimes” dataset. You can use ?hate_crimes
to read more about it and the variables.
You’ll need to load these packages to do this lab:
library(fivethirtyeight) # new to you!
library(moderndive)
library(skimr)
library(tidyverse)
library(GGally) # new to you!
We’ll use hate_crimes
to demonstrate multiple regression with:
avg_hatecrimes_per_100k_fbi
)share_pop_hs
)gini_index
)share_vote_trump
). At a later stage, we’ll convert this variable to a factor.Recall that a new exploratory data analysis involves three things:
General functions we use below- add narrative to interpret each!:
dplyr::glimpse()
skimr::skim()
ggplot2::ggplot()
geom_histogram()
or geom_density()
for numeric continuous variablesgeom_bar()
or geom_col()
for categorical variablesAt this stage, you may also find your want to use filter
, mutate
, arrange
, select
, or count
. Let your questions lead you! Feel free to add onto the EDA that follows.
glimpse(hate_crimes)
Observations: 51
Variables: 12
$ state <chr> "Alabama", "Alaska", "Arizona", "A...
$ median_house_inc <int> 42278, 67629, 49254, 44922, 60487,...
$ share_unemp_seas <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
$ share_pop_metro <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
$ share_pop_hs <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
$ share_non_citizen <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
$ share_white_poverty <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
$ gini_index <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
$ share_non_white <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
$ share_vote_trump <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
$ hate_crimes_per_100k_splc <dbl> 0.12583893, 0.14374012, 0.22531995...
$ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...
hate_crimes %>%
count(state, sort = TRUE)
# A tibble: 51 x 2
state n
<chr> <int>
1 Alabama 1
2 Alaska 1
3 Arizona 1
4 Arkansas 1
5 California 1
6 Colorado 1
7 Connecticut 1
8 Delaware 1
9 District of Columbia 1
10 Florida 1
# ... with 41 more rows
Let’s select just the variables we need first.
hate_demo <- hate_crimes %>%
select(state, avg_hatecrimes_per_100k_fbi, share_pop_hs, gini_index,
share_vote_trump)
Following the narrative in ModernDive, write a few sentences describing the output here.
skim(hate_demo)
Skim summary statistics
n obs: 51
n variables: 5
Variable type: character
variable missing complete n min max empty n_unique
state 0 51 51 4 20 0 51
Variable type: numeric
variable missing complete n mean sd p0 p25 p50
avg_hatecrimes_per_100k_fbi 1 50 51 2.37 1.71 0.27 1.29 1.99
gini_index 0 51 51 0.45 0.021 0.42 0.44 0.45
share_pop_hs 0 51 51 0.87 0.034 0.8 0.84 0.87
share_vote_trump 0 51 51 0.49 0.12 0.04 0.41 0.49
p75 p100 hist
3.18 10.95 ▇▇▅▁▁▁▁▁
0.47 0.53 ▅▅▇▇▁▁▁▁
0.9 0.92 ▂▅▅▃▃▅▇▆
0.57 0.7 ▁▁▁▃▇▇▆▃
First let’s look at the outcome variable:
# Density of hate crimes (DV):
ggplot(hate_demo, aes(x = avg_hatecrimes_per_100k_fbi)) +
geom_density() +
labs(x = "", title = "Hate Crimes")
Next we’ll look at our three explanatory variables as continuous:
# Histogram of share_pop_hs (IV):
ggplot(hate_demo, aes(x = share_pop_hs)) +
geom_density() +
labs(x = "", title = "HS")
# Histogram of gini (IV):
ggplot(hate_demo, aes(x = gini_index)) +
geom_density() +
labs(x = "", title = "Gini")
# Histogram of trump (IV):
ggplot(hate_demo, aes(x = share_vote_trump)) +
geom_density() +
labs(x = "", title = "Trump")
Let’s make share_vote_trump
a categorical variable:
hate_demo <- hate_demo %>%
mutate(
cat_trump = case_when(
share_vote_trump < .5 ~ "less than half",
TRUE ~ "more than half"
)) %>%
mutate(cat_trump = as.factor(cat_trump)) %>%
select(-share_vote_trump)
Following the narrative in ModernDive, write a few sentences describing the output here.
Part I of this EDA was univariate in nature in that we only considered one variable at a time. The goal of modeling, however, is to explore relationships between variables. Specifically, we care about bivariate relationships between pairs of variables. But with 1 outcome and 3 explanatory variables, that means we have \(3 \times 2 = 6\) correlations to compute.
For simple regression, we calculated correlation coefficients between the outcome and explanatory variables. For multiple regression, your EDA should involve multiple correlation coefficients. We’ll use the cor()
function to do this. You’ll want to first select
only numeric variables first.
Use this code as an example:
data %>%
select(-my_char_var, -my_factor_var) %>%
cor()
To produce this output:
avg_hatecrimes_per_100k_fbi share_pop_hs
avg_hatecrimes_per_100k_fbi 1 NA
share_pop_hs NA 1.0000000
gini_index NA -0.5920518
gini_index
avg_hatecrimes_per_100k_fbi NA
share_pop_hs -0.5920518
gini_index 1.0000000
Lots of NA
correlations though! Try this code instead:
data %>%
select(-my_char_var, -my_factor_var) %>%
cor(., use = "pairwise.complete.obs")
You could do the same thing in the corrr
package, using the correlate
function:
library(corrr)
data %>%
select(-my_char_var, -my_factor_var) %>%
correlate()
We also want to create scatterplots to see the association between each pair of variables in the model (both between the explanatory and the outcome, but also between all explanatory variables with each other). Let’s start with the gini_index
:
We could actually do all of this with one function! We’ll use GGally::ggpairs()
to create a pairwise comparison of multivariate data. This includes what is known as a “Generalized Pairs Plot,” which is an improved version of a scatterplot matrix. This function provides two different comparisons of each pair of columns, and displays either the density (continuous numeric) or count (factors) of the respective variable along the diagonal. You can read more about the function and package here.
There are three pieces to the output: lower
, upper
, and diag
. Read more about the sections of the matrix here.
Here is how you can use the function:
data %>%
select(-my_char_var) %>%
ggpairs()
And here is some demo output of how to use it using a dataset called tips
:
data(tips, package = "reshape")
tips %>%
select(total_bill, time, tip) %>%
ggpairs()
And it builds from ggplot2
, so you can add aesthetic mappings for color, etc. Hurray!
tips %>%
ggpairs(., aes(color = sex),
columns = c("total_bill", "time", "tip"))
Calculate all the correlation coefficients. Make a
ggpairs
plot between all explanatory and outcome variables. Following this narrative in ModernDive, write a few sentences describing the output here.
Do the following:
+
(gini_index
and share_pop_hs
)gini_index
and cat_trump
)gini_index
and cat_trump
)*
(gini_index
and share_pop_hs
)\[b_{x_1} = r_{x_1~y} ~\frac{s_y}{s_{x_1}}\]
avg_hatecrimes_per_100k_fbi
goes on the z-axis (vertical axis)share_vote_hs
is on of the floor axes.gini_index
is on the other floor axis.library(plotly)
dim_scatter <- plot_ly(hate_demo,
x = ~share_pop_hs,
y = ~gini_index,
z = ~avg_hatecrimes_per_100k_fbi) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'HS'),
yaxis = list(title = 'Gini'),
zaxis = list(title = 'Hate Crimes')))
dim_scatter