Day 4

Exploratory data analysis (EDA)

  • Oftentimes, real world data is messy
  • Ex: you come up with a thesis topic and write down all the theory and statistical analysis you want to perform
    • You find a relevant dataset, but should you trust it?
  • With large datasets, unrealistic to go through and check all the data
  • Plotting a few key relationships can help us determine how usable our data is
  • ggplot is your best friend when it comes to EDA!

ggplot2: a grammar for graphics

  • created by Hadley Wickham
  • purposefully written to follow good graphing taxonomy
  • allows you to create a graph in small, readable code chunks
  • install the ggplot2 package on your computer (if needed)
> library(ggplot2)
> packageVersion("ggplot2")
[1] '3.2.1'

ggplot2 grammar

  • If you are really interested in the theory of Data Visualization, see Ch. 2
    • Basic idea is that there is a hierarchy of how we notice graphical features
  • ggplot attempts to follow that hierarchy
ggplot(data) +    # data
  <geometry_funs>(aes(<variables>)) + # aesthetic variable mapping
  <label_funs> +  # add context
  <facet_funs> +  # add facets (optional)
  <coordinate_funs> +  # play with coords (optional)
  <scale_funs> + # play with scales (optional)
  <theme_funs> # play with axes, colors, etc (optional)
  • can layer geometry functions
  • can plot stats from raw data

Survey data

  • the dataset Survey provides student data from a survey:
> survey <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/Survey.csv")
  • Look first at GPA (Q15) and possible sci or non-sci major (Q16)

Data

  • The first layer sets the graphical environment
> ggplot(data=survey) 

Aesthetics

  • Aesthetics describes the mapping of variables to your graph
    • for numeric, default scale is Cartesian
  • aes aesthetics can also be given in the ggplot command
> ggplot(data=survey, aes(x=Question.15)) 

Geometry

  • The geometry determines what form the plot has
  • What type of plot would be useful here?
    • Here a histogram makes sense:
> ggplot(data=survey, aes(x=Question.15)) + geom_histogram()

EDA for data clean up

  • Obviously there were some typos in the GPA data.
  • Subset the data so only cases with a reasonable GPA are included.
> survey <- survey[survey$Question.15 <=4.0 & survey$Question.15 >0,]
> ggplot(data=survey, aes(x=Question.15)) + geom_histogram()

Faceting

  • Can add a third variable, science or non-sci major (Q16), with a facet:
> ggplot(data=survey, aes(x=Question.15)) + geom_histogram() + facet_wrap(~Question.16)

Another option

  • Could also use side-by-side boxplots:
> ggplot(data=survey, aes(y=Question.15, x=Question.16)) + geom_boxplot() 

  • This is where ggplot shines. Very easy to change the plot

ggplot and NA’s

  • Need to omit NA rows from data if you want them omitted from your graphics
> survey <- survey[!is.na(survey$Question.16),]
> ggplot(data=survey, aes(y=Question.15, x=Question.16)) + geom_boxplot() 

Adjusting coordinates

  • We can flip the x/y coordinates (boxplot always wants x to be categorical and y numeric)
> ggplot(data=survey, aes(y=Question.15, x=Question.16)) + geom_boxplot() + coord_flip()

Context! Add labels

  • Options include ggtitle and labs, change text size too
> ggplot(data=survey, aes(y=Question.15, x=Question.16)) + geom_boxplot() + coord_flip() + 
+   labs(title="GPA by major", x="Major Area", y="GPA") + 
+   theme(text=element_text(size=18))

How to proceed

  • Get to know basic command structure (ggplot + geom)
    • Use cheat sheet to see aes options for each geom
  • Then add context layers: labels, font sizes, etc
  • ?theme: non-data features (fonts, legends, axes)
  • ?scale_: scale (x,y,fill, shape) data features

Scatterplot

  • The previous example looks at the properties of one variable
  • Now, let’s look at the relationship between two variables
    • number of facebook friends (Q12) and GPA (Q15)
> ggplot(survey, aes(x=Question.12,y=Question.15)) + geom_point() 

Scatterplot

Change symbol size and shape

> ggplot(survey, aes(x=Question.12,y=Question.15)) + geom_point(size=3,shape=2) 

Scatterplot

Change symbol shape by major (Q16): requires aes argument!

> ggplot(survey, aes(x=Question.12,y=Question.15)) + 
+   geom_point(aes(shape=Question.16), size=3) 

Scatterplot

Change symbol color by major (Q16): requires aes argument!

  • the color is the discrete variable “major”, that is why scale_color_discrete is used
> ggplot(survey, aes(x=Question.12,y=Question.15)) + 
+   geom_point(aes(color=Question.16), size=3) + scale_color_discrete(name="major")

Scatterplot

Change symbol size by number of tv hours/week (Q14): requires aes argument!

> ggplot(survey, aes(x=Question.12,y=Question.15)) + 
+   geom_point(aes(color=Question.16, size=Question.14)) + scale_color_discrete(name="major")

Scatterplot

The size is the continuous variable tv, that is why scale_size_continous is used

> ggplot(survey, aes(x=Question.12,y=Question.15)) + 
+   geom_point(aes(color=Question.16, size=Question.14)) + scale_color_discrete(name="major") + 
+   scale_size_continuous(name="tv hours/week")

Bar graphs

How does political comfort level (Q9) vary by religious group (Q8)?

> levels(survey$Question.8) <- c("not religious","religious, active", "religious, not active")
> ggplot(survey, aes(x=Question.8, fill=Question.9)) + geom_bar(position="fill")