Data wrangling: multiple data tables

Day 9

Merging data frames

Many organizations store various types of data in different sources
- different “spreadsheets” or databases
- think about all the info CMC has about you!
  - registrar (grades, courses); Dean of Student (personal info); Admissions (HS, application info)
We need a common key that can ID the same individual across multiple data tables/bases.
- name, ID number are common
- anything that uniquely IDs your units of analysis
Today: joining/merging tables using a common key

Different types of joins

Mutating joins - combine vars from both tables
- inner_join(x,y, by=c("x_key"="y_key")): returns all rows in x that have a match in y
- left_join(x,y, by=c("x_key"="y_key")): returns all rows in x, will fill NA in y columns with no match
- right_join(x,y, by=c("x_key"="y_key")): returns all rows in y, will fill NA in x columns with no match
- full_join(x,y, by=c("x_key"="y_key")): returns all rows from both x and y, will fill NA in columns with no match
Filtering joins - keep variables from left-hand table
- semi_join(x,y, by=c("x_key"="y_key")): returns all rows from x where there are matching values in y, keeping just columns from x
- anti_join(x,y, by=c("x_key"="y_key")): returns all rows from x where there are not matching values in y, keeping just columns from x

Example: class grades

I want to know how people are performing in this class relative to their average GPA
Load dataframe with class grades

> class_grades <- data.frame(student_ID=1:5,
+                            grade=c("A","A+","B-","B+","A-"),
+                            college=c("PIT","CMC","CMC","CMC","SRC"))
> class_grades %>% as.tbl
# A tibble: 5 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          1 A     PIT    
2          2 A+    CMC    
3          3 B-    CMC    
4          4 B+    CMC    
5          5 A-    SRC

Example: class grades

I can get access to CMC students’ GPAs but not the other colleges

> GPA <- data.frame(CMC_ID=c(2:4,10:11),gpa=c(2.7,4.0,3.1,3.6,3.7))
> GPA %>% as.tbl
# A tibble: 5 x 2
  CMC_ID   gpa
   <int> <dbl>
1      2   2.7
2      3   4  
3      4   3.1
4     10   3.6
5     11   3.7

Example: class grades

How should I join the two tables?
- inner_join seems like a good solution
What keys to use?
- The keys have different variable names but that’s OK

> inner_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 3 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          2 A+    CMC       2.7
2          3 B-    CMC       4  
3          4 B+    CMC       3.1

What are the downsides to this inner join?
- I don’t see my PIT and SRC students anymore
What if I want to compare grades within the class as well as GPAs?

Example: class grades

Let’s try a left_join to keep all students

> left_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 5 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          1 A     PIT      NA  
2          2 A+    CMC       2.7
3          3 B-    CMC       4  
4          4 B+    CMC       3.1
5          5 A-    SRC      NA

Left join will never remove information from the left dataframe
- Perhaps the fact that there isn’t a match is important
Because I can’t access PIT and SRC data, GPAs will be NA for those students

Example: class grades

What will a right_join give us?

> right_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 5 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          2 A+    CMC       2.7
2          3 B-    CMC       4  
3          4 B+    CMC       3.1
4         10 <NA>  <NA>      3.6
5         11 <NA>  <NA>      3.7

All the students that I have GPA information for, even if they are not in my class

Example: class grades

What does full_join do?

> full_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 7 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          1 A     PIT      NA  
2          2 A+    CMC       2.7
3          3 B-    CMC       4  
4          4 B+    CMC       3.1
5          5 A-    SRC      NA  
6         10 <NA>  <NA>      3.6
7         11 <NA>  <NA>      3.7

All possible information

Example: class grades

What if I just want to know which rows have a match?

> semi_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 3 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          2 A+    CMC    
2          3 B-    CMC    
3          4 B+    CMC

Only keeps variables from the left table (no GPA)
Only returns rows which have a match
- Similar to inner_join but will never duplicate rows of left table

Example: class grades

What if I just want to know which rows don’t match

> anti_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 2 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          1 A     PIT    
2          5 A-    SRC

Tells me which students don’t have a GPA match (BC they are not CMC students)
Sometimes identifying which records don’t match is the goal itself
- For example, may tell you which years, states, students are missing so you can request that data