The goal of this lab is to carefully, thoroughly, and thoughtfully compare two group means and understand the results. You are also asked to communicate clearly about the steps in your analysis process with others, by sharing your R code, output, and narrative. As such, your code cannot "stand alone"- it is meant to complement / enhance / support your narrative. As with our previous integrative lab, this lab will be due in two stages:
html
file is due on Sakai by Thursday August 30th at 2:30pm.Using the key, your self-assessment should include even more narrative; where you made mistakes, you must discuss and analyze where you went wrong, and correct them without copying/pasting directly from the key (this typically means that you need to include more narrative than we provide in the key). A good self-assessment will include:
B.5.5 Traditional methods
)Please install a new version of infer
first:
remotes::install_github("tidymodels/infer", ref = "develop")
library(tidyverse)
library(infer)
library(skimr)
library(broom)
We are going to work with a dataset called fishermen_mercury.csv
, which consists of factors related to mercury levels among fishermen and a control group of non-fishermen. The data are published within the paper available online. Here is the citation for the peer-reviewed publication:
N.B. Al-Majed and M.R. Preston (2000). "Factors Influencing the Total Mercury and Methyl Mercury in the Hair of Fishermen in Kuwait," Environmental Pollution, Vol. 109, pp. 239-250.
mercury <- read_csv(here::here("data", "fishermen_mercury.csv"),
col_types = cols(
fisherman = col_factor(levels = NULL))
)
Variables in this dataset include:
fisherman
)age
)restime
)height
)weight
)fishmlwk
)methyl_mercury
)total_mercury
)Do fishermen have different levels of total mercury than non-fishermen?
Explore the fisherman
and total_mercury
variables. Recall that a new exploratory data analysis involves three things:
General functions that we have used at this stage:
dplyr::glimpse()
skimr::skim()
dplyr::group_by()
first hereggplot2::ggplot()
geom_histogram()
or geom_density()
or geom_boxplot()
for numeric continuous variables
facet_wrap()
geom_bar()
or geom_col()
for categorical variablesYou may also find your want to use filter
, mutate
, arrange
, select
, or count
. Let your questions lead you!
Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. The infer
package does not automatically check conditions for the theoretical methods to work. In order for the results of the \(t\)-test to be valid, three conditions must be met:
Write a few sentences to address each of these conditions.
The following are excerpts from ModernDive:
"The most common normalization is known as the \(z\)-score. The formula for a \(z\)-score is \[Z = \frac{x - \mu}{\sigma},\] where \(x\) represent the value of a variable, \(\mu\) represents the mean of the variable, and \(\sigma\) represents the standard deviation of the variable. Thus, if your variable has 10 elements, each one has a corresponding \(z\)-score that gives how many standard deviations away that value is from its mean. \(z\)-scores are normally distributed with mean 0 and standard deviation 1. They have the common, bell-shaped pattern seen below."
"Recall, that we hardly ever know the mean and standard deviation of the population of interest. This is almost always the case when considering the means of two independent groups. To help account for us not knowing the population parameter values, we can use the sample statistics instead, but this comes with a bit of a price in terms of complexity."
"Another form of normalization occurs when we need to use the sample standard deviations as estimates for the unknown population standard deviations. This normalization is often called the \(t\)-score."
For the two independent samples case like what we have for comparing fishermen to non-fishermen, the formula is \[{\displaystyle T={\frac {({\bar {x}}_{1}-{\bar {x}}_{2})- (\mu_1-\mu_2)}{s_{p}\times {\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}}}\]
where...
From the Minitab blog:
"A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases."
Note that the quantity \((\mu_1 - \mu_2)\) is typically 0 under the null hypothesis, so the actual computational formula for the \(t\)-statistic is:
\[{\displaystyle T={\frac {{\bar {x}}_{1}-{\bar {x}}_{2}}{s_{p}\times {\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}}}\]
where...
\[{\displaystyle s_{p}= {\sqrt {\frac {\left(n_{1}-1\right)s_{1}^{2}+\left(n_{2}-1\right)s_{2}^{2}}{n_{1}+n_{2}-2}}}}\]
where...
The \(s_p\) here is called the "pooled standard deviation." Now that we have two samples (in our case, fishermen and non-fishermen), we have two standard deviations. Recall that in order to calculate the standard error of one mean we divided the sample standard deviation by the square root of \(N\). The pooled standard error is the same idea, but after combining the standard deviations from two samples. The calculation is important, as it assumes that there is not a difference in the standard deviations between the two groups, and hence they can be pooled. Remember that \(s^2\) is the squared standard deviation, which we know is the variance (i.e., var(x1)
= (sd(x1))^2
and sd(x1)
= sqrt(var(x1))
).
Given the above formulas, answer the following questions using R code plus narrative:
What is the pooled standard error for the total_mercury
variable? (try to make all quantities here defined variables using dplyr::group_by() %>% summarize() %>% filter() %>% pull()
)
What are the minimum \(t\)-statistic values we need to conclude that there is a significant difference at \(\alpha = .05\) (2-tailed)? (hint: degrees of freedom for a two-sample t-test are \(N\) - 2)
Use the following code to plot these critical t-values.
upper_tcrit <- # fill in here
mercury %>%
specify(total_mercury ~ fisherman) %>%
hypothesize(null = "independence") %>%
calculate(stat = "t", order = c(1,0)) %>%
visualize(method = "theoretical",
obs_stat = upper_tcrit,
direction = "both") # gives us shading
\[{\displaystyle T \times {s_{p}\times {\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}}={{\bar {x}}_{1}-{\bar {x}}_{2}}\]
With \(n_1\) fisherman and \(n_2\) non-fishermen, given the variability in total mercury present in this sample, we will reject the null hypothesis that there is no difference in total mercury levels between the two groups if we obtain a \(t\)-statistic greater than X (absolute value, \(\alpha\) = .05, 2-tailed). This is equivalent to saying we will reject the null hypothesis if we obtain a mean difference greater than X (absolute value, \(\alpha\) = .05, 2-tailed).
Answer the following questions using R code and narrative:
What mean difference (raw) did we observe? Is it greater than the minimum mean difference we need to beat? (hint: try infer::specify
then calculate
with stat = "diff in means"
).
Calculate the \(t\)-statistic "by hand" using R. Is it greater than the minimum \(t\)-statistic we need to beat? (here is the formula again):
\[{\displaystyle T={\frac {{\bar {x}}_{1}-{\bar {x}}_{2}}{s_{p}\times {\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}}}\]
infer
, do the results match your calculations?:mercury %>%
t_test(total_mercury ~ fisherman, order = c(1, 0), var.equal = TRUE)
Interpret this output. What is the statistic? What is the meaning/interpretation of the confidence interval here? (hint: Does it include zero? What does it mean if it does or does not? What value is in the center of the interval?)
Use the code below to compare the normal distribution to the \(t\)-distribution, which has only one parameter: degrees of freedom (df
). You can use the dt
and dnorm
with the stat_function
layer in ggplot
to plot the densities for each distribution. Change the df
argument here a few times to view its effect: what do you see? (you don't need to print any of these plots to your final file- I want you to reflect on how the two distributions differ).
df <- 10
ggplot(data.frame(x = c(-4, 4)), aes(x)) +
stat_function(fun = dt, args = list(df = df)) + # t dist
stat_function(fun = dnorm, lty = 3, color = "red") # normal dist in red
infer
to make a plot of the null t-distribution with a red line for your obs_stat
, shading in both directions. Try adding geom_vline()
to this object so you can add vertical lines where your two "critical t-values" are.But, how can we calculate the p-value from this? Well the easy way is with your infer
output :) But! Remember that your observed \(t\)-statistic is a quantile value for a statistic that we assume follows a \(t\)-distribution. How do we calculate the probability of getting a \(t\)-statistic as or more extreme than the one we got? We use pt()
!
pt()
to calculate the 2-sided p-value in R. Does it match the output from infer
?Unfortunately, we currently can't get the confidence interval for theoretical methods using infer
. But we can use formulas (yay):
\[ ({\bar {x}}_{1}-{\bar {x}}_{2}) \pm (t_{\star} \times SE_{{\bar {x}}_{1}-{\bar {x}}_{2}})\] where...
\[SE_{{\bar {x}}_{1}-{\bar {x}}_{2}} = s_{p} \times \sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}\] You should already have variables for the mean difference and the \(t\)-statistic critical value(s) (here, \(t_{\star}\)) from the top formula. In the bottom formula, that standard error should look familiar- it is the standard error of the difference between means. This was the denominator in your calculated \(t\)-statistic.
infer
? Does it contain zero? Will this interval ever not contain the observed difference in sample means?Using ModernDive 10.7 as an example, as well as our previous resampling lab, conduct a permutation test to evaluate whether there is a mean difference in total mercury between fishermen and non-fishermen. What do you conclude, and how do these results compare to those based on the classical hypothesis test? What, if any, assumptions of each method are comfortable or uncomfortable with?
Use a simple linear regression to analyze total_mercury ~ fisherman
. Look carefully at the output, and compare to the output of the classical t-test. What do you notice is the same?