Overview

A complete knitted html file is due on Sakai by beginning of class Tuesday July 10th (2:30pm).

The goal is to explore a new-to-you dataset. In particular, to begin to establish a workflow for data frames or “tibbles”. You will use dplyr and ggplot2 to do some description and visualization.

Your homework should serve as your own personal cheatsheet in the future for things to do with a new dataset. Give yourself the cheatsheet you deserve! You’ll submit your work as an html file knit from your .Rmd file. Remember:

  • dplyr should be your data manipulation tool
  • ggplot2 should be your visualization tool

Relax about…

  • Tidying/reshaping is NOT your assignment. Many of your tables will be awkwardly shaped in the report. That’s OK.
  • Table beauty is not a big deal. Simply printing to “screen” is fine.
  • For all things, graphical and tabular, if you’re dissatisfied with a result, discuss the problem, what you’ve tried and move on (remember my 30-minute rule).

Explore the data

You will explore the reprohealth dataset, which is distributed as an R package from my personal GitHub.

Install it, and remember to use this code only in your R console, not in a script or .Rmd file:

install.packages("remotes") # install the remotes package
library(remotes) # load remotes package so you can install from github
install_github("apreshill/reprohealth") # install the package

Then at the top of your lab, copy and paste this code:

library(tidyverse) # you'll need this too
library(reprohealth) # load the package
wb_stats <- wb_reprohealth # save the data to your local environment

Using R Markdown, do the following in R code chunks:

  • Print the first 10 rows of the wb_stats data (hint: just name it!).
  • Do a dplyr::glimpse of the wb_stats data.
  • Plot life expectancy versus GDP per capita.

In the text portion, answer the following questions based on the output above:

  • How many variables/columns?
  • How many rows/observations?
  • Which variables are numbers?
  • Which are categorical variables (numeric or character variables with variables that have a fixed and known set of possible values; aka factors: variables)?
  • What do you see in the scatterplot?

Explore individual variables

Pick at least one categorical variable and at least one quantitative variable to explore individually (i.e., one-at-a-time).

  • What are possible values (or range, whichever is appropriate) of each variable?
  • What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand.

Feel free to use summary stats, tables, figures. We’re NOT expecting high production value (yet).

Explore various plot types

See the ggplot2 tutorial for ideas. Also you might check out:

Make a few plots, probably of the same variable you chose to characterize numerically. Try to explore more than one plot type. Just as an example of what we mean:

  • A scatterplot of two quantitative variables.
  • A plot of one quantitative variable. Maybe a histogram or densityplot or frequency polygon.
  • A plot of one quantitative variable and one categorical. Maybe boxplots for several continents or countries.

You don’t have to use all the data in every plot! It’s fine to filter down to one country or small handful of countries.

Super-extra-bonus-points for playing with knitr chunk options to improve the display of your plots in the knitted html file, try this blog post for tips.

Report your process

You’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc. Give credit to your sources, whether it’s a blog post, a fellow student, an online tutorial, etc.

Grading

Our general grading guidelines

Check minus: There are some mistakes or omissions, such as the number of rows or variables in the data frame. Or maybe it is only in the output but not in the text. There are no plots or there’s just one type of plot (and its probably a scatterplot). It’s hard to figure out where to look for what in this doc.

Check: Hits all the elements. No obvious mistakes. Pleasant to read. No heroic detective work required. Solid.

Check plus: Some “above and beyond”, creativity, etc. You learned something new from reviewing their work and you’re eager to incorporate it into your work now. The ggplot2 figures are quite diverse. The knitted html is very organized and well formatted.

This lab was adapted from Jenny Bryan’s STAT545 class.

Creative Commons License