Day 9

Merging data frames

  • Many organizations store various types of data in different sources
    • different “spreadsheets” or databases
    • think about all the info CMC has about you!
      • registrar (grades, courses); Dean of Student (personal info); Admissions (HS, application info)
  • We need a common key that can ID the same individual across multiple data tables/bases.
    • name, ID number are common
    • anything that uniquely IDs your units of analysis
  • Today: joining/merging tables using a common key

Different types of joins

  • Mutating joins - combine vars from both tables
    • inner_join(x,y, by=c("x_key"="y_key")): returns all rows in x that have a match in y
    • left_join(x,y, by=c("x_key"="y_key")): returns all rows in x, will fill NA in y columns with no match
    • right_join(x,y, by=c("x_key"="y_key")): returns all rows in y, will fill NA in x columns with no match
    • full_join(x,y, by=c("x_key"="y_key")): returns all rows from both x and y, will fill NA in columns with no match
  • Filtering joins - keep variables from left-hand table
    • semi_join(x,y, by=c("x_key"="y_key")): returns all rows from x where there are matching values in y, keeping just columns from x
    • anti_join(x,y, by=c("x_key"="y_key")): returns all rows from x where there are not matching values in y, keeping just columns from x

Example: class grades

  • I want to know how people are performing in this class relative to their average GPA
  • Load dataframe with class grades
> class_grades <- data.frame(student_ID=1:5,
+                            grade=c("A","A+","B-","B+","A-"),
+                            college=c("PIT","CMC","CMC","CMC","SRC"))
> class_grades %>% as.tbl
# A tibble: 5 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          1 A     PIT    
2          2 A+    CMC    
3          3 B-    CMC    
4          4 B+    CMC    
5          5 A-    SRC    

Example: class grades

  • I can get access to CMC students’ GPAs but not the other colleges
> GPA <- data.frame(CMC_ID=c(2:4,10:11),gpa=c(2.7,4.0,3.1,3.6,3.7))
> GPA %>% as.tbl
# A tibble: 5 x 2
  CMC_ID   gpa
   <int> <dbl>
1      2   2.7
2      3   4  
3      4   3.1
4     10   3.6
5     11   3.7

Example: class grades

  • How should I join the two tables?
    • inner_join seems like a good solution
  • What keys to use?
    • The keys have different variable names but that’s OK
> inner_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 3 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          2 A+    CMC       2.7
2          3 B-    CMC       4  
3          4 B+    CMC       3.1
  • What are the downsides to this inner join?
    • I don’t see my PIT and SRC students anymore
  • What if I want to compare grades within the class as well as GPAs?

Example: class grades

  • Let’s try a left_join to keep all students
> left_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 5 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          1 A     PIT      NA  
2          2 A+    CMC       2.7
3          3 B-    CMC       4  
4          4 B+    CMC       3.1
5          5 A-    SRC      NA  
  • Left join will never remove information from the left dataframe
    • Perhaps the fact that there isn’t a match is important
  • Because I can’t access PIT and SRC data, GPAs will be NA for those students

Example: class grades

  • What will a right_join give us?
> right_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 5 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          2 A+    CMC       2.7
2          3 B-    CMC       4  
3          4 B+    CMC       3.1
4         10 <NA>  <NA>      3.6
5         11 <NA>  <NA>      3.7
  • All the students that I have GPA information for, even if they are not in my class

Example: class grades

  • What does full_join do?
> full_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 7 x 4
  student_ID grade college   gpa
       <int> <fct> <fct>   <dbl>
1          1 A     PIT      NA  
2          2 A+    CMC       2.7
3          3 B-    CMC       4  
4          4 B+    CMC       3.1
5          5 A-    SRC      NA  
6         10 <NA>  <NA>      3.6
7         11 <NA>  <NA>      3.7
  • All possible information

Example: class grades

  • What if I just want to know which rows have a match?
> semi_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 3 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          2 A+    CMC    
2          3 B-    CMC    
3          4 B+    CMC    
  • Only keeps variables from the left table (no GPA)
  • Only returns rows which have a match
    • Similar to inner_join but will never duplicate rows of left table

Example: class grades

  • What if I just want to know which rows don’t match
> anti_join(class_grades,GPA,by = c("student_ID"="CMC_ID")) %>% as.tbl
# A tibble: 2 x 3
  student_ID grade college
       <int> <fct> <fct>  
1          1 A     PIT    
2          5 A-    SRC    
  • Tells me which students don’t have a GPA match (BC they are not CMC students)
  • Sometimes identifying which records don’t match is the goal itself
    • For example, may tell you which years, states, students are missing so you can request that data