Overview

A complete knitted html file is due on Sakai by beginning of class Tuesday July 17th (2:30pm). In class on that Tuesday, you’ll pair up with another student and review each others’ reports. Your final lab is due on Sakai by beginning of class Thursday July 19th (2:30pm).

The goal is to carefully, thoroughly, and thoughtfully conduct an exploratory data analysis (EDA). You are also asked to communicate clearly about the steps in your EDA process with others, by sharing your R code, output, and narrative. As such, your code cannot “stand alone”- it is meant to complement / enhance / support your narrative.

Tools

You will use R Markdown to construct your EDA. You’ll submit your work as an html file knit from your .Rmd file (please leave the default code chunk options for eval = TRUE and echo = TRUE). In that document, you’ll use dplyr, tidyr, and ggplot2 to do description and visualization. You may also wish to use the janitor package to make tabyls, and some of the accompanying adorn functions.

Your lab should serve as your own personal cheatsheet in the future for things to do with a new dataset. Give yourself the cheatsheet you deserve! Remember:

  • rmarkdown should be your EDA documentation tool
  • your own words with markdown formatting are your ONLY narrative tool
  • dplyr should be your data manipulation tool
  • tidyr should be your data reshaping tool
  • janitor::tabyl should be your data table-making tool
  • you may wish to combine with knitr::kable() for formatting tables
  • ggplot2 should be your data visualization tool

For all things, graphical and tabular, if you’re dissatisfied with a result, discuss the problem, what you’ve tried and move on (remember my 30-minute rule). You’ll need this loaded at the top:

library(tidyverse) # all the good stuff
library(readxl) # for reading in xlsx files
library(here) # for here!
library(janitor) # for clean_names & tabyl
library(knitr) # for kable

The tidyverse meta-package includes dplyr, ggplot2, and tidyr.

Data

You will start with the reprohealth dataset, which is distributed as an R package from my personal GitHub.

Install it, and remember to use this code only in your R console, not in a script or .Rmd file:

install.packages("remotes") # install the remotes package
library(remotes) # load remotes package so you can install from github
install_github("apreshill/reprohealth") # install the package

Then at the top of your lab, copy and paste this code:

library(reprohealth) # load the package
wb_reprohealth # call the data
wb_stats <- wb_reprohealth %>% # save the data to your local environment
  mutate(year = parse_number(year)) # fix year

You can name wb_stats anything else you want, this is just an example.

Explore the wb_reprohealth data

Yes, this was Lab 04. You should already have the code and narrative to answer these questions. Use this opportunity to clean that up. This section should be clear, concise, and well-organized, since you already have the code and words in another .Rmd document that you can just copy/paste into this lab (right?). Can you make it better? Do a self-assessment here: what could be clearer about your code or your process that you can improve upon now?

  • How many variables/columns?
  • How many rows/observations?
  • Which variables are numbers?
  • Which are categorical variables (numeric or character variables with variables that have a fixed and known set of possible values; aka factor variables)?
  • Complete this sentence: “There is one row per…”

Explore a new dataset

Go to the gapminder site and find a new indicator that interests you.

  • Download the data file to a folder in your .Rproj directory, and open the .xlsx file to see where the data is (i.e., which sheet).

  • Read the data file into your EDA .Rmd document using the readxl package (remember, it must be installed first). Use the here package, with the here function, to build up the relative file path (example code-through). Here is an example code chunk:

internet <- read_xlsx(here::here("data", "Internet user per 100.xlsx"),
                      sheet = 1) %>% 
  clean_names() # highly recommended

internet
# A tibble: 275 x 23
   internet_users_… x1990 x1991 x1992 x1993    x1994    x1995    x1996
   <chr>            <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>
 1 Abkhazia            NA    NA    NA    NA NA       NA       NA      
 2 Afghanistan          0    NA    NA    NA NA       NA       NA      
 3 Akrotiri and Dh…    NA    NA    NA    NA NA       NA       NA      
 4 Albania              0    NA    NA    NA NA        0.0112   3.22e-2
 5 Algeria              0    NA    NA    NA  3.61e-4  0.00177  1.74e-3
 6 American Samoa       0    NA    NA    NA NA       NA       NA      
 7 Andorra              0    NA    NA    NA NA       NA        1.53e+0
 8 Angola               0    NA    NA    NA NA       NA        7.76e-4
 9 Anguilla            NA    NA    NA    NA NA       NA       NA      
10 Antigua and Bar…     0    NA    NA    NA NA        2.20     2.86e+0
# ... with 265 more rows, and 15 more variables: x1997 <dbl>, x1998 <dbl>,
#   x1999 <dbl>, x2000 <dbl>, x2001 <dbl>, x2002 <dbl>, x2003 <dbl>,
#   x2004 <dbl>, x2005 <dbl>, x2006 <dbl>, x2007 <dbl>, x2008 <dbl>,
#   x2009 <dbl>, x2010 <dbl>, x2011 <dbl>
  • Answer the same 5 questions as you did above for the reprohealth data.
  • Gather the data (most data from gapminder has columns in years).
internet_tidy <- internet %>%
  rename(country = 1) %>% 
  gather(year, users_per_100, -country) %>% 
  mutate(year = readr::parse_number(year)) 
# another example dataset
age_marriage <- read_xlsx(here::here("data", "indicator age of marriage.xlsx"),
                      sheet = 1) 

age_marriage
# A tibble: 185 x 117
   ..1   `1616.0` `1666.0` `1685.0` `1710.0` `1716.0` `1735.0` `1760.0`
   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Afgh…       NA       NA       NA       NA       NA       NA       NA
 2 Alba…       NA       NA       NA       NA       NA       NA       NA
 3 Alge…       NA       NA       NA       NA       NA       NA       NA
 4 Ango…       NA       NA       NA       NA       NA       NA       NA
 5 Arge…       NA       NA       NA       NA       NA       NA       NA
 6 Arme…       NA       NA       NA       NA       NA       NA       NA
 7 Aust…       NA       NA       NA       NA       NA       NA       NA
 8 Aust…       NA       NA       NA       NA       NA       NA       NA
 9 Azer…       NA       NA       NA       NA       NA       NA       NA
10 Baha…       NA       NA       NA       NA       NA       NA       NA
# ... with 175 more rows, and 109 more variables: `1766.0` <dbl>,
#   `1775.0` <dbl>, `1780.0` <dbl>, `1785.0` <dbl>, `1791.0` <dbl>,
#   `1800.0` <dbl>, `1810.0` <dbl>, `1815.0` <dbl>, `1825.0` <dbl>,
#   `1835.0` <dbl>, `1840.0` <dbl>, `1845.0` <dbl>, `1855.0` <dbl>,
#   `1860.0` <dbl>, `1865.0` <dbl>, `1866.0` <dbl>, `1870.0` <dbl>,
#   `1875.0` <dbl>, `1879.0` <dbl>, `1880.0` <dbl>, `1885.0` <dbl>,
#   `1887.0` <dbl>, `1890.0` <dbl>, `1895.0` <dbl>, `1897.0` <dbl>,
#   `1900.0` <dbl>, `1901.0` <dbl>, `1903.0` <dbl>, `1905.0` <dbl>,
#   `1906.0` <dbl>, `1907.0` <dbl>, `1910.0` <dbl>, `1911.0` <dbl>,
#   `1915.0` <dbl>, `1920.0` <dbl>, `1921.0` <dbl>, `1925.0` <dbl>,
#   `1928.0` <dbl>, `1930.0` <dbl>, `1931.0` <dbl>, `1935.0` <dbl>,
#   `1937.0` <dbl>, `1939.0` <dbl>, `1940.0` <dbl>, `1941.0` <dbl>,
#   `1942.0` <dbl>, `1943.0` <dbl>, `1944.0` <dbl>, `1945.0` <dbl>,
#   `1946.0` <dbl>, `1947.0` <dbl>, `1948.0` <dbl>, `1949.0` <dbl>,
#   `1950.0` <dbl>, `1951.0` <dbl>, `1952.0` <dbl>, `1953.0` <dbl>,
#   `1954.0` <dbl>, `1955.0` <dbl>, `1956.0` <dbl>, `1957.0` <dbl>,
#   `1958.0` <dbl>, `1959.0` <dbl>, `1960.0` <dbl>, `1961.0` <dbl>,
#   `1962.0` <dbl>, `1963.0` <dbl>, `1964.0` <dbl>, `1965.0` <dbl>,
#   `1966.0` <dbl>, `1967.0` <dbl>, `1968.0` <dbl>, `1969.0` <dbl>,
#   `1970.0` <dbl>, `1971.0` <dbl>, `1972.0` <dbl>, `1973.0` <dbl>,
#   `1974.0` <dbl>, `1975.0` <dbl>, `1976.0` <dbl>, `1977.0` <dbl>,
#   `1978.0` <dbl>, `1979.0` <dbl>, `1980.0` <dbl>, `1981.0` <dbl>,
#   `1982.0` <dbl>, `1983.0` <dbl>, `1984.0` <dbl>, `1985.0` <dbl>,
#   `1986.0` <dbl>, `1987.0` <dbl>, `1988.0` <dbl>, `1989.0` <dbl>,
#   `1990.0` <dbl>, `1991.0` <dbl>, `1992.0` <dbl>, `1993.0` <dbl>,
#   `1994.0` <dbl>, `1995.0` <dbl>, `1996.0` <dbl>, …
marriage_tidy <- age_marriage %>% 
  rename(country = 1) %>% 
  gather(year, age_first_marriage, -country) %>% 
  mutate(year = readr::parse_number(year))
marriage_tidy
# A tibble: 21,460 x 3
   country      year age_first_marriage
   <chr>       <dbl>              <dbl>
 1 Afghanistan  1616                 NA
 2 Albania      1616                 NA
 3 Algeria      1616                 NA
 4 Angola       1616                 NA
 5 Argentina    1616                 NA
 6 Armenia      1616                 NA
 7 Australia    1616                 NA
 8 Austria      1616                 NA
 9 Azerbaijan   1616                 NA
10 Bahamas      1616                 NA
# ... with 21,450 more rows

Pathological in the same way as the previous…

# yet another example dataset
alcohol <- read_xlsx(here::here("data", "indicator alcohol consumption  20100830.xlsx"),
                      sheet = 1) 

alcohol
# A tibble: 189 x 25
   ..1   `1985.0` `1986.0` `1987.0` `1988.0` `1989.0` `1990.0` `1991.0`
   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Afgh…       NA       NA       NA       NA       NA       NA       NA
 2 Alba…       NA       NA       NA       NA       NA       NA       NA
 3 Alge…       NA       NA       NA       NA       NA       NA       NA
 4 Ando…       NA       NA       NA       NA       NA       NA       NA
 5 Ango…       NA       NA       NA       NA       NA       NA       NA
 6 Anti…       NA       NA       NA       NA       NA       NA       NA
 7 Arge…       NA       NA       NA       NA       NA       NA       NA
 8 Arme…       NA       NA       NA       NA       NA       NA       NA
 9 Aust…       NA       NA       NA       NA       NA       NA       NA
10 Aust…       NA       NA       NA       NA       NA       NA       NA
# ... with 179 more rows, and 17 more variables: `1992.0` <dbl>,
#   `1993.0` <dbl>, `1994.0` <dbl>, `1995.0` <dbl>, `1996.0` <dbl>,
#   `1997.0` <lgl>, `1998.0` <dbl>, `1999.0` <lgl>, `2000.0` <dbl>,
#   `2001.0` <dbl>, `2002.0` <dbl>, `2003.0` <dbl>, `2004.0` <dbl>,
#   `2005.0` <dbl>, `2006.0` <dbl>, `2007.0` <dbl>, `2008.0` <dbl>
alc_tidy <- alcohol %>% 
  rename(country = 1) %>% 
  gather(year, alc_per_adult, -country) %>% 
  mutate(year = readr::parse_number(year))
alc_tidy
# A tibble: 4,536 x 3
   country              year alc_per_adult
   <chr>               <dbl>         <dbl>
 1 Afghanistan          1985            NA
 2 Albania              1985            NA
 3 Algeria              1985            NA
 4 Andorra              1985            NA
 5 Angola               1985            NA
 6 Antigua and Barbuda  1985            NA
 7 Argentina            1985            NA
 8 Armenia              1985            NA
 9 Australia            1985            NA
10 Austria              1985            NA
# ... with 4,526 more rows
  • Join this new data using dplyr with the reprohealth data you have been working with, and make some observations about the process and result. You may need:
# let's do this
int_repro <- wb_stats %>% 
  left_join(internet_tidy)
agem_repro <- wb_stats %>% 
  left_join(marriage_tidy)
alc_repro <- wb_stats %>% 
  left_join(alc_tidy)

Explore questions

In the R4DS Exploratory Data Analysis chapter, the authors say:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation…EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Your mission in this last section is to pick at least three of the tasks below and attack each with a table and figure. In your narrative, reword each “task” into a logical research question(s). Make observations about what your tables/figures show and about the process. If you want to do something comparable but different, i.e. swap one quantitative variable for another, be our guest! If you are feeling inspired and curious, then we’re doing this right. Go for it.

Task menu

  • Get the maximum and minimum of children per woman (tot_fertility) for all continents.

  • Look at the spread of children per woman (tot_fertility) across countries within the continents.

  • Compute a trimmed mean of maternal mortality (mat_mortality) for different years. Or a weighted mean, weighting by population. Just try something other than the plain vanilla mean.

  • How does maternal mortality (mat_mortality) vary across different continents?

  • Report the absolute and/or relative abundance of countries with low maternal mortality (mat_mortality) over time by continent: Compute some measure of worldwide maternal mortality - you decide - a mean or median or some other quantile or perhaps your current age. Then determine how many countries on each continent have a maternal mortality rate less than this benchmark, for each year.

  • Find countries with interesting stories. Open-ended and, therefore, hard. Promising but unsuccessful attempts are encouraged. This will generate interesting questions to follow up on in class.

  • Make up your own! Between the coverage in class, readings, and DataCamp plus the list above, you get the idea.

Companion graphs

For each table, make sure to include a relevant figure. One tip for starting is to draw out on paper what you want your x- and y-axis to be first and what your geom is; that is, start by drawing the plot you want ggplot to give you.

Your figure does not have to depict every last number from the data aggregation result. Use your judgement. It just needs to complement the table, add context, and allow for some sanity checking both ways.

Notice which figures are easy/hard to make, which data formats make better inputs for plotting functions vs. for human-friendly tables.

Report your process

You’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc. Give credit to your sources, whether it’s a blog post, a fellow student, an online tutorial, etc.

Grading

This lab is worth 5 points total, scored as follows:

2 points for your initial EDA:

  • 2 (Strong attempt): EDA reflects strong independent problem solving, with clearly thought out attempts to approach questions and problems, and a diligent and honest effort to answer questions and find the solutions.

  • 1 (Adequate attempt): EDA reflects some attempt to approach the posed tasks, but approach appears to be superficial and lacks depth of analysis. No obvious mistakes. Pleasant to read. No head scratchers. Solid and complete.

  • 0 (No attempt or incomplete): Didn’t tackle all sections. Or didn’t make companion graphs. Or didn’t interpret anything but left it all to the “reader”. Or more than one technical problem that is relatively easy to fix.

3 points for the quality of the final EDA:

  • 3 (Exceptional): EDA is thorough, concise, and clearly demonstrates ability to competently and thoughtfully work with data as well as how to report on that process as a complement to code. Impeccable organization and presentation in the report.

  • 2 (Adequate): Hits all the elements in all sections. No obvious mistakes. Pleasant to read. No head scratchers. Solid and complete.

  • 1 (Inadequate): EDA attempts to address question with substantial inaccuracies in analysis and/or interpretation. Didn’t tackle all sections. Or didn’t make companion graphs. Or didn’t interpret anything but left it all to the “reader”. Or more than one technical problem that is relatively easy to fix.

  • 0 (Insufficient): Nothing to grade, assignment was late.

Creative Commons License