Day 7

Data Wrangling

  • Data manipulation is an extremely important data science skill
  • methods for filtering, transforming, restructuring and combining datasets should be
    • readable
    • reproducible
    • include checks of data quality (typos, code errors, etc)
  • Hadley Wickham’s dplyr package is designed for this type of wrangling

The dplyr package

  • package is designed for readable data wrangling
    • everything you can do with dplyr can be done without it (but maybe not as nicely/easily)
  • There are 5 key verbs (actions):
    • filter(): pick observations (rows) of interest (subset)
    • arrange(): sort obervations
    • select(): pick variables (columns)
    • mutate(): modify existing variables or create new variables
    • summarize(): collapse many values down into a single summary
  • group_by(): changes the scope, or unit of analysis, from individual cases to aggregated groups
    • use with either summarize or mutate

filter

  • Basic command: filter(data, condition)
  • filter(data, cond1, cond2): rows that meet for cond1 AND cond2
  • Let’s filter (subset) the loans data to get the 300 bad loans cases:
> loans <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/CreditData.csv")
> badLoans  <- filter(loans, Good.Loan == "BadLoan")
> dim(badLoans)
[1] 300  21
> badLoans2 <- loans[loans$Good.Loan == "BadLoan",]
> dim(badLoans2)
[1] 300  21
  • What about bad loans that are costly?!
> costlyBadLoans <- filter(loans, Good.Loan == "BadLoan", Credit.amount > 10000)
> dim(costlyBadLoans)
[1] 24 21

select

  • Basic command:
    • select(data, var1, var2) gets vars 1 and 2 (add - to omit)
    • select(data, var3:var5) gets vars 3-5 (add - to omit)
    • select(data, starts_with("ab")) select variables starting with “ab”
    • select(data, ends_with("yz")) select variables ending with “yz”
    • select(data, contains("mn")) selects variables containing “mn” strings
  • Omit variables with a period . in the name:
> loansNoPeriods <- select(loans, -contains("."))
> names(loansNoPeriods)
[1] "Purpose"   "Property"  "Housing"   "Job"       "Telephone"

select

  • Get credit amount from loans
> credit <- select(loans, Credit.amount)
> str(credit)  # one vector is still a data frame! 
'data.frame':   1000 obs. of  1 variable:
 $ Credit.amount: int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
  • dplyr commands preserve data frame class
  • base-R subsetting does not always preserve class
> credit2 <- loans[, "Credit.amount"]
> str(credit2)  
 int [1:1000] 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...

Piping

  • piping %>% allows you to put the output of one function into the input of another
    • akin to function composition
    • x %>% f() %>% g() is equal to g(f(x))
    • x %>% f() %>% g() is read: take x, then apply function f, then apply function g
    • g(f(x)) is read the opposite direction: apply function g to function f applied to x
    • Think of it like an assembly line where you pass the completed task on to another worker

Piping

  • Let’s pull the credit and duration for bad loans
> badloans <- loans %>% 
+   filter(Good.Loan == "BadLoan") %>%
+   select(Credit.amount, Duration.in.month)

is the same as

> badLoans2  <- select(filter(loans, Good.Loan == "BadLoan"), Credit.amount, Duration.in.month)
  • Same but more readable (means less likely to make errors)
  • How does the filter command differ in the piping vs non-piping case?

mutate

  • Basic command: mutate(data, newvar1= fun1, newvar2= fun2)
  • Adds a newly created variable(s) to data
    • can use newvar1 to create newvar2 in the same command
  • Called a vectorized command because it applies a function to all individual variable entries (preserved data frame length)
> loans %>%
+             mutate(durationYears = Duration.in.month/12) %>%
+             select(Duration.in.month, durationYears)
     Duration.in.month durationYears
1                    6     0.5000000
2                   48     4.0000000
3                   12     1.0000000
4                   42     3.5000000
5                   24     2.0000000
6                   36     3.0000000
7                   24     2.0000000
8                   36     3.0000000
9                   12     1.0000000
10                  30     2.5000000
11                  12     1.0000000
12                  48     4.0000000
13                  12     1.0000000
14                  24     2.0000000
15                  15     1.2500000
16                  24     2.0000000
17                  24     2.0000000
18                  30     2.5000000
19                  24     2.0000000
20                  24     2.0000000
21                   9     0.7500000
22                   6     0.5000000
23                  10     0.8333333
24                  12     1.0000000
25                  10     0.8333333
26                   6     0.5000000
27                   6     0.5000000
28                  12     1.0000000
29                   7     0.5833333
30                  60     5.0000000
31                  18     1.5000000
32                  24     2.0000000
33                  18     1.5000000
34                  12     1.0000000
35                  12     1.0000000
36                  45     3.7500000
37                  48     4.0000000
38                  18     1.5000000
39                  10     0.8333333
40                   9     0.7500000
41                  30     2.5000000
42                  12     1.0000000
43                  18     1.5000000
44                  30     2.5000000
45                  48     4.0000000
46                  11     0.9166667
47                  36     3.0000000
48                   6     0.5000000
49                  11     0.9166667
50                  12     1.0000000
51                  24     2.0000000
52                  27     2.2500000
53                  12     1.0000000
54                  18     1.5000000
55                  36     3.0000000
56                   6     0.5000000
57                  12     1.0000000
58                  36     3.0000000
59                  18     1.5000000
60                  36     3.0000000
61                   9     0.7500000
62                  15     1.2500000
63                  36     3.0000000
64                  48     4.0000000
65                  24     2.0000000
66                  27     2.2500000
67                  12     1.0000000
68                  12     1.0000000
69                  36     3.0000000
70                  36     3.0000000
71                  36     3.0000000
72                   7     0.5833333
73                   8     0.6666667
74                  42     3.5000000
75                  36     3.0000000
76                  12     1.0000000
77                  42     3.5000000
78                  11     0.9166667
79                  54     4.5000000
80                  30     2.5000000
81                  24     2.0000000
82                  15     1.2500000
83                  18     1.5000000
84                  24     2.0000000
85                  10     0.8333333
86                  12     1.0000000
87                  18     1.5000000
88                  36     3.0000000
89                  18     1.5000000
90                  12     1.0000000
91                  12     1.0000000
92                  12     1.0000000
93                  12     1.0000000
94                  24     2.0000000
95                  12     1.0000000
96                  54     4.5000000
97                  12     1.0000000
98                  18     1.5000000
99                  36     3.0000000
100                 20     1.6666667
101                 24     2.0000000
102                 36     3.0000000
103                  6     0.5000000
104                  9     0.7500000
105                 12     1.0000000
106                 24     2.0000000
107                 18     1.5000000
108                 12     1.0000000
109                 24     2.0000000
110                 14     1.1666667
111                  6     0.5000000
112                 15     1.2500000
113                 18     1.5000000
114                 36     3.0000000
115                 12     1.0000000
116                 48     4.0000000
117                 42     3.5000000
118                 10     0.8333333
119                 33     2.7500000
120                 12     1.0000000
121                 21     1.7500000
122                 24     2.0000000
123                 12     1.0000000
124                 10     0.8333333
125                 18     1.5000000
126                 12     1.0000000
127                 12     1.0000000
128                 12     1.0000000
129                 12     1.0000000
130                 12     1.0000000
131                 48     4.0000000
132                 36     3.0000000
133                 15     1.2500000
134                 18     1.5000000
135                 60     5.0000000
136                 12     1.0000000
137                 27     2.2500000
138                 12     1.0000000
139                 15     1.2500000
140                 12     1.0000000
141                  6     0.5000000
142                 36     3.0000000
143                 27     2.2500000
144                 18     1.5000000
145                 21     1.7500000
146                 48     4.0000000
147                  6     0.5000000
148                 12     1.0000000
149                 36     3.0000000
150                 18     1.5000000
151                  6     0.5000000
152                 10     0.8333333
153                 36     3.0000000
154                 24     2.0000000
155                 24     2.0000000
156                 12     1.0000000
157                  9     0.7500000
158                 12     1.0000000
159                 24     2.0000000
160                  6     0.5000000
161                 24     2.0000000
162                 18     1.5000000
163                 15     1.2500000
164                 10     0.8333333
165                 36     3.0000000
166                  6     0.5000000
167                 18     1.5000000
168                 11     0.9166667
169                 24     2.0000000
170                 24     2.0000000
171                 15     1.2500000
172                 12     1.0000000
173                 24     2.0000000
174                  8     0.6666667
175                 21     1.7500000
176                 30     2.5000000
177                 12     1.0000000
178                  6     0.5000000
179                 12     1.0000000
180                 21     1.7500000
181                 36     3.0000000
182                 36     3.0000000
183                 21     1.7500000
184                 24     2.0000000
185                 18     1.5000000
186                 15     1.2500000
187                  9     0.7500000
188                 16     1.3333333
189                 12     1.0000000
190                 18     1.5000000
191                 24     2.0000000
192                 48     4.0000000
193                 27     2.2500000
194                  6     0.5000000
195                 45     3.7500000
196                  9     0.7500000
197                  6     0.5000000
198                 12     1.0000000
199                 24     2.0000000
200                 18     1.5000000
201                  9     0.7500000
202                 12     1.0000000
203                 27     2.2500000
204                 12     1.0000000
205                 12     1.0000000
206                 30     2.5000000
207                 12     1.0000000
208                 12     1.0000000
209                 24     2.0000000
210                 12     1.0000000
211                  9     0.7500000
212                 36     3.0000000
213                 27     2.2500000
214                 30     2.5000000
215                 36     3.0000000
216                  6     0.5000000
217                 18     1.5000000
218                 36     3.0000000
219                 24     2.0000000
220                 10     0.8333333
221                 12     1.0000000
222                 12     1.0000000
223                 12     1.0000000
224                 24     2.0000000
225                 15     1.2500000
226                 36     3.0000000
227                 48     4.0000000
228                 12     1.0000000
229                  9     0.7500000
230                 24     2.0000000
231                 36     3.0000000
232                  9     0.7500000
233                 12     1.0000000
234                 18     1.5000000
235                  4     0.3333333
236                 24     2.0000000
237                  6     0.5000000
238                 21     1.7500000
239                 12     1.0000000
240                 30     2.5000000
241                 24     2.0000000
242                  6     0.5000000
243                 48     4.0000000
244                 12     1.0000000
245                 12     1.0000000
246                 24     2.0000000
247                 12     1.0000000
248                  6     0.5000000
249                 24     2.0000000
250                 18     1.5000000
251                  6     0.5000000
252                 12     1.0000000
253                 30     2.5000000
254                 24     2.0000000
255                  9     0.7500000
256                 60     5.0000000
257                 24     2.0000000
258                 12     1.0000000
259                 15     1.2500000
260                 11     0.9166667
261                 12     1.0000000
262                 24     2.0000000
263                 18     1.5000000
264                 12     1.0000000
265                 10     0.8333333
266                 15     1.2500000
267                 36     3.0000000
268                 24     2.0000000
269                 14     1.1666667
270                 24     2.0000000
271                 18     1.5000000
272                 12     1.0000000
273                 48     4.0000000
274                 48     4.0000000
275                 30     2.5000000
276                  9     0.7500000
277                 18     1.5000000
278                 12     1.0000000
279                  6     0.5000000
280                 24     2.0000000
281                 15     1.2500000
282                 12     1.0000000
283                 18     1.5000000
284                 15     1.2500000
285                 24     2.0000000
286                 47     3.9166667
287                 48     4.0000000
288                 48     4.0000000
289                 12     1.0000000
290                 24     2.0000000
291                 12     1.0000000
292                 36     3.0000000
293                 24     2.0000000
294                 42     3.5000000
295                 48     4.0000000
296                 48     4.0000000
297                 12     1.0000000
298                 10     0.8333333
299                 18     1.5000000
300                 21     1.7500000
301                  6     0.5000000
302                 36     3.0000000
303                 24     2.0000000
304                 10     0.8333333
305                 48     4.0000000
306                  6     0.5000000
307                 30     2.5000000
308                 12     1.0000000
309                  8     0.6666667
310                  9     0.7500000
311                 48     4.0000000
312                 24     2.0000000
313                 24     2.0000000
314                 12     1.0000000
315                  4     0.3333333
316                 36     3.0000000
317                 12     1.0000000
318                 24     2.0000000
319                 12     1.0000000
320                 15     1.2500000
321                 30     2.5000000
322                 24     2.0000000
323                 24     2.0000000
324                 18     1.5000000
325                 18     1.5000000
326                  8     0.6666667
327                 12     1.0000000
328                 24     2.0000000
329                 36     3.0000000
330                  6     0.5000000
331                 24     2.0000000
332                 18     1.5000000
333                 60     5.0000000
334                 48     4.0000000
335                 24     2.0000000
336                  6     0.5000000
337                 13     1.0833333
338                 15     1.2500000
339                 24     2.0000000
340                 10     0.8333333
341                 24     2.0000000
342                 21     1.7500000
343                 18     1.5000000
344                 18     1.5000000
345                 10     0.8333333
346                 15     1.2500000
347                 13     1.0833333
348                 24     2.0000000
349                  6     0.5000000
350                  9     0.7500000
351                  9     0.7500000
352                  9     0.7500000
353                 18     1.5000000
354                 12     1.0000000
355                 10     0.8333333
356                 24     2.0000000
357                 12     1.0000000
358                 36     3.0000000
359                 12     1.0000000
360                 30     2.5000000
361                 18     1.5000000
362                 12     1.0000000
363                 12     1.0000000
364                  6     0.5000000
365                 18     1.5000000
366                 12     1.0000000
367                 18     1.5000000
368                 18     1.5000000
369                 36     3.0000000
370                 18     1.5000000
371                 36     3.0000000
372                 18     1.5000000
373                 10     0.8333333
374                 60     5.0000000
375                 60     5.0000000
376                 48     4.0000000
377                 18     1.5000000
378                  7     0.5833333
379                 36     3.0000000
380                  6     0.5000000
381                 20     1.6666667
382                 18     1.5000000
383                 22     1.8333333
384                 12     1.0000000
385                 30     2.5000000
386                 18     1.5000000
387                 18     1.5000000
388                 18     1.5000000
389                 15     1.2500000
390                  9     0.7500000
391                 18     1.5000000
392                 12     1.0000000
393                 36     3.0000000
394                  6     0.5000000
395                  9     0.7500000
396                 39     3.2500000
397                 12     1.0000000
398                 36     3.0000000
399                 12     1.0000000
400                 24     2.0000000
401                 18     1.5000000
402                 18     1.5000000
403                 24     2.0000000
404                 14     1.1666667
405                 18     1.5000000
406                 24     2.0000000
407                 24     2.0000000
408                 15     1.2500000
409                 24     2.0000000
410                 12     1.0000000
411                 24     2.0000000
412                 33     2.7500000
413                 12     1.0000000
414                 10     0.8333333
415                 24     2.0000000
416                 36     3.0000000
417                 12     1.0000000
418                 18     1.5000000
419                 21     1.7500000
420                 18     1.5000000
421                 15     1.2500000
422                 12     1.0000000
423                 12     1.0000000
424                 21     1.7500000
425                 12     1.0000000
426                 18     1.5000000
427                 28     2.3333333
428                 18     1.5000000
429                  9     0.7500000
430                 18     1.5000000
431                  5     0.4166667
432                 24     2.0000000
433                  6     0.5000000
434                 24     2.0000000
435                  9     0.7500000
436                 12     1.0000000
437                  6     0.5000000
438                 24     2.0000000
439                 42     3.5000000
440                 12     1.0000000
441                 12     1.0000000
442                 12     1.0000000
443                 20     1.6666667
444                 12     1.0000000
445                 48     4.0000000
446                  9     0.7500000
447                 36     3.0000000
448                  7     0.5833333
449                 12     1.0000000
450                 15     1.2500000
451                 36     3.0000000
452                  6     0.5000000
453                 12     1.0000000
454                 24     2.0000000
455                 24     2.0000000
456                 24     2.0000000
457                 11     0.9166667
458                 12     1.0000000
459                  6     0.5000000
460                 18     1.5000000
461                 36     3.0000000
462                 15     1.2500000
463                 12     1.0000000
464                 12     1.0000000
465                 18     1.5000000
466                 24     2.0000000
467                 24     2.0000000
468                 48     4.0000000
469                 33     2.7500000
470                 24     2.0000000
471                 24     2.0000000
472                  6     0.5000000
473                  9     0.7500000
474                  6     0.5000000
475                 18     1.5000000
476                 18     1.5000000
477                 39     3.2500000
478                 24     2.0000000
479                 12     1.0000000
480                 15     1.2500000
481                 12     1.0000000
482                 24     2.0000000
483                 30     2.5000000
484                 15     1.2500000
485                 12     1.0000000
486                  6     0.5000000
487                 12     1.0000000
488                 24     2.0000000
489                 10     0.8333333
490                  6     0.5000000
491                 12     1.0000000
492                 27     2.2500000
493                  6     0.5000000
494                  6     0.5000000
495                 12     1.0000000
496                 24     2.0000000
497                 36     3.0000000
498                 24     2.0000000
499                 18     1.5000000
500                  6     0.5000000
501                 24     2.0000000
502                 36     3.0000000
503                  9     0.7500000
504                 24     2.0000000
505                 24     2.0000000
506                 10     0.8333333
507                 15     1.2500000
508                 15     1.2500000
509                 24     2.0000000
510                 39     3.2500000
511                 12     1.0000000
512                 36     3.0000000
513                 15     1.2500000
514                 12     1.0000000
515                 24     2.0000000
516                  6     0.5000000
517                  6     0.5000000
518                 36     3.0000000
519                  6     0.5000000
520                  6     0.5000000
521                 24     2.0000000
522                 18     1.5000000
523                 48     4.0000000
524                 24     2.0000000
525                 18     1.5000000
526                 26     2.1666667
527                 15     1.2500000
528                  4     0.3333333
529                 36     3.0000000
530                  6     0.5000000
531                 36     3.0000000
532                 15     1.2500000
533                 12     1.0000000
534                 24     2.0000000
535                 24     2.0000000
536                 21     1.7500000
537                  6     0.5000000
538                 18     1.5000000
539                 48     4.0000000
540                 18     1.5000000
541                 12     1.0000000
542                 24     2.0000000
543                 30     2.5000000
544                 18     1.5000000
545                 12     1.0000000
546                 24     2.0000000
547                 24     2.0000000
548                 24     2.0000000
549                 12     1.0000000
550                 48     4.0000000
551                 12     1.0000000
552                  6     0.5000000
553                 48     4.0000000
554                 12     1.0000000
555                  9     0.7500000
556                 12     1.0000000
557                 18     1.5000000
558                 21     1.7500000
559                 24     2.0000000
560                 18     1.5000000
561                 24     2.0000000
562                 24     2.0000000
563                  6     0.5000000
564                 36     3.0000000
565                 24     2.0000000
566                 24     2.0000000
567                 12     1.0000000
568                 24     2.0000000
569                 48     4.0000000
570                 48     4.0000000
571                 24     2.0000000
572                 30     2.5000000
573                 24     2.0000000
574                 15     1.2500000
575                  9     0.7500000
576                 15     1.2500000
577                 12     1.0000000
578                 24     2.0000000
579                 36     3.0000000
580                 24     2.0000000
581                 18     1.5000000
582                 12     1.0000000
583                  9     0.7500000
584                 36     3.0000000
585                 12     1.0000000
586                 18     1.5000000
587                  9     0.7500000
588                 12     1.0000000
589                 18     1.5000000
590                 12     1.0000000
591                 12     1.0000000
592                 24     2.0000000
593                 21     1.7500000
594                 24     2.0000000
595                 24     2.0000000
596                  6     0.5000000
597                 24     2.0000000
598                 24     2.0000000
599                 18     1.5000000
600                 24     2.0000000
601                  7     0.5833333
602                  9     0.7500000
603                 24     2.0000000
604                 36     3.0000000
605                 10     0.8333333
606                 24     2.0000000
607                 24     2.0000000
608                 36     3.0000000
609                 18     1.5000000
610                 15     1.2500000
611                 12     1.0000000
612                 10     0.8333333
613                 21     1.7500000
614                 24     2.0000000
615                 18     1.5000000
616                 48     4.0000000
617                 60     5.0000000
618                  6     0.5000000
619                 30     2.5000000
620                 12     1.0000000
621                 21     1.7500000
622                 18     1.5000000
623                 48     4.0000000
624                 12     1.0000000
625                 18     1.5000000
626                 15     1.2500000
627                  6     0.5000000
628                  9     0.7500000
629                 42     3.5000000
630                  9     0.7500000
631                 24     2.0000000
632                 18     1.5000000
633                 15     1.2500000
634                  9     0.7500000
635                 24     2.0000000
636                 12     1.0000000
637                 24     2.0000000
638                 60     5.0000000
639                 12     1.0000000
640                 42     3.5000000
641                 18     1.5000000
642                 15     1.2500000
643                 15     1.2500000
644                 24     2.0000000
645                 18     1.5000000
646                 36     3.0000000
647                 30     2.5000000
648                 12     1.0000000
649                 24     2.0000000
650                 12     1.0000000
651                 48     4.0000000
652                 12     1.0000000
653                 24     2.0000000
654                 36     3.0000000
655                 24     2.0000000
656                 14     1.1666667
657                 12     1.0000000
658                 48     4.0000000
659                 30     2.5000000
660                 18     1.5000000
661                 12     1.0000000
662                 12     1.0000000
663                 21     1.7500000
664                  6     0.5000000
665                  6     0.5000000
666                 24     2.0000000
667                 30     2.5000000
668                 48     4.0000000
669                 12     1.0000000
670                 30     2.5000000
671                 24     2.0000000
672                 36     3.0000000
673                 60     5.0000000
674                  6     0.5000000
675                 21     1.7500000
676                 30     2.5000000
677                 24     2.0000000
678                 72     6.0000000
679                 24     2.0000000
680                 18     1.5000000
681                  6     0.5000000
682                 12     1.0000000
683                 15     1.2500000
684                 24     2.0000000
685                 36     3.0000000
686                 60     5.0000000
687                 10     0.8333333
688                 36     3.0000000
689                  9     0.7500000
690                 12     1.0000000
691                 15     1.2500000
692                 15     1.2500000
693                 24     2.0000000
694                  6     0.5000000
695                 24     2.0000000
696                  6     0.5000000
697                 12     1.0000000
698                 12     1.0000000
699                 18     1.5000000
700                 15     1.2500000
701                 12     1.0000000
702                 48     4.0000000
703                 24     2.0000000
704                 30     2.5000000
705                 27     2.2500000
706                 15     1.2500000
707                 48     4.0000000
708                 12     1.0000000
709                  9     0.7500000
710                  9     0.7500000
711                 18     1.5000000
712                  6     0.5000000
713                 21     1.7500000
714                  9     0.7500000
715                 60     5.0000000
716                 30     2.5000000
717                 30     2.5000000
718                 18     1.5000000
719                 24     2.0000000
720                 20     1.6666667
721                  9     0.7500000
722                  6     0.5000000
723                 12     1.0000000
724                  9     0.7500000
725                 27     2.2500000
726                  6     0.5000000
727                 15     1.2500000
728                 18     1.5000000
729                 48     4.0000000
730                 24     2.0000000
731                 24     2.0000000
732                 24     2.0000000
733                  8     0.6666667
734                 24     2.0000000
735                  4     0.3333333
736                 36     3.0000000
737                 24     2.0000000
738                 18     1.5000000
739                  6     0.5000000
740                 30     2.5000000
741                 24     2.0000000
742                 10     0.8333333
743                 21     1.7500000
744                 24     2.0000000
745                 39     3.2500000
746                 13     1.0833333
747                 15     1.2500000
748                 12     1.0000000
749                 21     1.7500000
750                 15     1.2500000
751                  6     0.5000000
752                 18     1.5000000
753                 12     1.0000000
754                 30     2.5000000
755                 12     1.0000000
756                 24     2.0000000
757                  6     0.5000000
758                 15     1.2500000
759                 24     2.0000000
760                 12     1.0000000
761                 15     1.2500000
762                 18     1.5000000
763                 12     1.0000000
764                 21     1.7500000
765                 24     2.0000000
766                 12     1.0000000
767                 30     2.5000000
768                 10     0.8333333
769                 12     1.0000000
770                 12     1.0000000
771                 24     2.0000000
772                 36     3.0000000
773                 21     1.7500000
774                 24     2.0000000
775                 12     1.0000000
776                 24     2.0000000
777                 36     3.0000000
778                 18     1.5000000
779                 36     3.0000000
780                 18     1.5000000
781                 39     3.2500000
782                 24     2.0000000
783                 12     1.0000000
784                 12     1.0000000
785                 20     1.6666667
786                 18     1.5000000
787                 22     1.8333333
788                 48     4.0000000
789                 48     4.0000000
790                 40     3.3333333
791                 21     1.7500000
792                 24     2.0000000
793                  6     0.5000000
794                 24     2.0000000
795                 24     2.0000000
796                  9     0.7500000
797                 18     1.5000000
798                 12     1.0000000
799                 24     2.0000000
800                  9     0.7500000
801                 24     2.0000000
802                 18     1.5000000
803                 20     1.6666667
804                 12     1.0000000
805                 12     1.0000000
806                 36     3.0000000
807                  6     0.5000000
808                 12     1.0000000
809                 42     3.5000000
810                 15     1.2500000
811                  8     0.6666667
812                  6     0.5000000
813                 36     3.0000000
814                 48     4.0000000
815                 48     4.0000000
816                 36     3.0000000
817                  6     0.5000000
818                  6     0.5000000
819                 36     3.0000000
820                 18     1.5000000
821                 12     1.0000000
822                 12     1.0000000
823                 36     3.0000000
824                  8     0.6666667
825                 18     1.5000000
826                 21     1.7500000
827                 18     1.5000000
828                 18     1.5000000
829                 36     3.0000000
830                 48     4.0000000
831                 24     2.0000000
832                 18     1.5000000
833                 45     3.7500000
834                 24     2.0000000
835                 15     1.2500000
836                 12     1.0000000
837                 12     1.0000000
838                  4     0.3333333
839                 24     2.0000000
840                 24     2.0000000
841                 36     3.0000000
842                 21     1.7500000
843                 18     1.5000000
844                 24     2.0000000
845                 18     1.5000000
846                 21     1.7500000
847                 18     1.5000000
848                 24     2.0000000
849                  9     0.7500000
850                 12     1.0000000
851                 20     1.6666667
852                 24     2.0000000
853                 15     1.2500000
854                 18     1.5000000
855                 36     3.0000000
856                 24     2.0000000
857                 10     0.8333333
858                 15     1.2500000
859                 15     1.2500000
860                  9     0.7500000
861                 24     2.0000000
862                 18     1.5000000
863                 24     2.0000000
864                 27     2.2500000
865                 10     0.8333333
866                 15     1.2500000
867                 18     1.5000000
868                 12     1.0000000
869                 36     3.0000000
870                 12     1.0000000
871                 36     3.0000000
872                  6     0.5000000
873                 24     2.0000000
874                 15     1.2500000
875                 12     1.0000000
876                 11     0.9166667
877                 18     1.5000000
878                 36     3.0000000
879                  9     0.7500000
880                 30     2.5000000
881                 24     2.0000000
882                 24     2.0000000
883                 30     2.5000000
884                 18     1.5000000
885                 24     2.0000000
886                 12     1.0000000
887                 24     2.0000000
888                 48     4.0000000
889                 36     3.0000000
890                 28     2.3333333
891                 27     2.2500000
892                 15     1.2500000
893                 12     1.0000000
894                 36     3.0000000
895                 18     1.5000000
896                 36     3.0000000
897                 21     1.7500000
898                 12     1.0000000
899                 15     1.2500000
900                 18     1.5000000
901                 16     1.3333333
902                 20     1.6666667
903                 36     3.0000000
904                 15     1.2500000
905                 24     2.0000000
906                 12     1.0000000
907                 21     1.7500000
908                 36     3.0000000
909                 15     1.2500000
910                  9     0.7500000
911                 36     3.0000000
912                 24     2.0000000
913                 30     2.5000000
914                 11     0.9166667
915                 24     2.0000000
916                 48     4.0000000
917                 10     0.8333333
918                  6     0.5000000
919                 24     2.0000000
920                 24     2.0000000
921                 18     1.5000000
922                 48     4.0000000
923                  9     0.7500000
924                 12     1.0000000
925                 24     2.0000000
926                 12     1.0000000
927                 18     1.5000000
928                 48     4.0000000
929                 30     2.5000000
930                 12     1.0000000
931                 24     2.0000000
932                  9     0.7500000
933                  9     0.7500000
934                 12     1.0000000
935                 12     1.0000000
936                 30     2.5000000
937                  9     0.7500000
938                  6     0.5000000
939                 60     5.0000000
940                 24     2.0000000
941                 12     1.0000000
942                 10     0.8333333
943                 24     2.0000000
944                  4     0.3333333
945                 15     1.2500000
946                 48     4.0000000
947                 24     2.0000000
948                 12     1.0000000
949                 18     1.5000000
950                 24     2.0000000
951                 18     1.5000000
952                 36     3.0000000
953                 24     2.0000000
954                 36     3.0000000
955                 12     1.0000000
956                 24     2.0000000
957                 30     2.5000000
958                  9     0.7500000
959                 28     2.3333333
960                 24     2.0000000
961                  6     0.5000000
962                 21     1.7500000
963                 15     1.2500000
964                 24     2.0000000
965                  6     0.5000000
966                 30     2.5000000
967                 27     2.2500000
968                 15     1.2500000
969                 42     3.5000000
970                 11     0.9166667
971                 15     1.2500000
972                 24     2.0000000
973                 24     2.0000000
974                 60     5.0000000
975                 30     2.5000000
976                 24     2.0000000
977                  6     0.5000000
978                 18     1.5000000
979                 24     2.0000000
980                 15     1.2500000
981                 30     2.5000000
982                 48     4.0000000
983                 21     1.7500000
984                 36     3.0000000
985                 24     2.0000000
986                 15     1.2500000
987                 42     3.5000000
988                 13     1.0833333
989                 24     2.0000000
990                 24     2.0000000
991                 12     1.0000000
992                 15     1.2500000
993                 18     1.5000000
994                 36     3.0000000
995                 12     1.0000000
996                 12     1.0000000
997                 30     2.5000000
998                 12     1.0000000
999                 45     3.7500000
1000                45     3.7500000

mutate

  • base-R: construct loan default predictions
> loans$pred.Def1 <- ifelse(loans$Duration.in.month > 24 & loans$Credit.amount > 10000,
+                           "predBad", "predGood")
> loans$pred.Def <- ifelse(loans$Duration.in.month <= 24 & loans$Credit.amount < 2200,
+                          "predBad", loans$pred.Def1)
> head(loans[, c("Duration.in.month","Credit.amount","pred.Def1","pred.Def","Good.Loan")], 6)
  Duration.in.month Credit.amount pred.Def1 pred.Def Good.Loan
1                 6          1169  predGood  predBad  GoodLoan
2                48          5951  predGood predGood   BadLoan
3                12          2096  predGood  predBad  GoodLoan
4                42          7882  predGood predGood  GoodLoan
5                24          4870  predGood predGood   BadLoan
6                36          9055  predGood predGood  GoodLoan

How can we do this using dplyr?

mutate

  • dplyr: construct loan default predictions
> loans2 <- loans %>%
+             mutate(pred.Def1 = ifelse(Duration.in.month > 24 & Credit.amount >10000,
+                                       "predBad", "predGood"), 
+                    pred.Def = ifelse(Duration.in.month <= 24 & Credit.amount < 2200,
+                                      "predBad", pred.Def1)) %>%
+             select(Duration.in.month, Credit.amount, pred.Def1, pred.Def, Good.Loan)
> head(loans2, 10)
   Duration.in.month Credit.amount pred.Def1 pred.Def Good.Loan
1                  6          1169  predGood  predBad  GoodLoan
2                 48          5951  predGood predGood   BadLoan
3                 12          2096  predGood  predBad  GoodLoan
4                 42          7882  predGood predGood  GoodLoan
5                 24          4870  predGood predGood   BadLoan
6                 36          9055  predGood predGood  GoodLoan
7                 24          2835  predGood predGood  GoodLoan
8                 36          6948  predGood predGood  GoodLoan
9                 12          3059  predGood predGood  GoodLoan
10                30          5234  predGood predGood   BadLoan

summarize

  • Basic command: summarize(data, newvar1= fun1(var), newvar2= fun2(var))
  • summarize applies the fun to variable(s) and returns one value
    • takes a data frame with n rows and returns a data frame with 1 row
  • Here are the mean and sd for credit amount
> loans %>% summarize(mn = mean(Credit.amount), sd = sd(Credit.amount))
        mn       sd
1 3271.258 2822.737

summarize

  • proportion of all loans that defaulted that are less or equal to 24 months duration
  • base-R
> mean(loans$Duration.in.month[loans$Good.Loan == "BadLoan"] <= 24)
[1] 0.66
  • dplyr
> loans %>% filter(Good.Loan == "BadLoan") %>%
+   mutate(less24 = Duration.in.month <= 24) %>%
+   summarize(prop = mean(less24))
  prop
1 0.66

group_by

  • the most awesome dplyr feature is the group_by command
  • group_by(groupVar1, groupVar2) changes the analysis unit from the individuals to the groups defined by groupVars
    • followed by summarize, gives stats by groups
    • followed by mutate, applies function by group
  • can ungroup() to return “focus” back to entire data frame (instead of groups)

group_by

  • credit amounts by loan status
    • n() gives us the number of observations
> loans.summarize <- loans %>% group_by(Good.Loan) %>%
+              summarize(mn = mean(Credit.amount), sd = sd(Credit.amount), n = n())
> loans.summarize
# A tibble: 2 x 4
  Good.Loan    mn    sd     n
  <fct>     <dbl> <dbl> <int>
1 BadLoan   3938. 3536.   300
2 GoodLoan  2985. 2401.   700

Tibbles

  • Hadley has also created a new type of “opinionated” data frame called a tibble.
> class(loans.summarize)
[1] "tbl_df"     "tbl"        "data.frame"
> glimpse(loans.summarize)
Observations: 2
Variables: 4
$ Good.Loan <fct> BadLoan, GoodLoan
$ mn        <dbl> 3938.127, 2985.457
$ sd        <dbl> 3535.819, 2401.472
$ n         <int> 300, 700
  • the group_by command will always convert a base-R data.frame to a tibble data frame

Tibbles

  • why use tibbles?
    • smart view by calling data frame (only shows a data snapshot rather than entire data frame)
    • preserves object type, subsetting always yields another tibble
> loans.summarize[1,1]
# A tibble: 1 x 1
  Good.Loan
  <fct>    
1 BadLoan  

group_by

  • can include more than one grouping variable
> loans2 %>% group_by(Good.Loan, pred.Def) %>% print(n=2)
# A tibble: 1,000 x 5
# Groups:   Good.Loan, pred.Def [4]
  Duration.in.month Credit.amount pred.Def1 pred.Def Good.Loan
              <int>         <int> <chr>     <chr>    <fct>    
1                 6          1169 predGood  predBad  GoodLoan 
2                48          5951 predGood  predGood BadLoan  
# ... with 998 more rows

group_by

  • Accuracy rate = % of all loans that were correctly predicted to default or not
  • How can we use group_by to calculate the accuracy rate?
> loans2 %>% group_by(Good.Loan, pred.Def) %>%
+   summarize(n = n(), prop = n/nrow(.))
# A tibble: 4 x 4
# Groups:   Good.Loan [2]
  Good.Loan pred.Def     n  prop
  <fct>     <chr>    <int> <dbl>
1 BadLoan   predBad    137 0.137
2 BadLoan   predGood   163 0.163
3 GoodLoan  predBad    355 0.355
4 GoodLoan  predGood   345 0.345
  • The period . inserts the data frame used in that step
  • Accuracy rate: 13.7 + 34.5 = 48.2%

group_by

  • What if we want false positive and false negative rate?
    • false positive rate: % of good loans that were predicted to default
    • false negative rate: % of bad loans that were predicted to not default
  • Instead of dividing by overall n we need to divide by n within Good.Loan
  • additional group_by will replace previous groupings
> loans2 %>% group_by(Good.Loan, pred.Def) %>%  summarize(n = n()) %>%
+   group_by(Good.Loan) %>%
+   mutate(condProp = n/sum(n))
# A tibble: 4 x 4
# Groups:   Good.Loan [2]
  Good.Loan pred.Def     n condProp
  <fct>     <chr>    <int>    <dbl>
1 BadLoan   predBad    137    0.457
2 BadLoan   predGood   163    0.543
3 GoodLoan  predBad    355    0.507
4 GoodLoan  predGood   345    0.493
  • false positive rate: 50.7% of good loans were predicted to default
  • false negative rate: 54.3% of bad loans were predicted to not default

weather data

  • The MonthlyWeather.csv data (from Team Project 1) was constructed from over 130k+ daily weather records from the 113 airports.
  • Our first look at how to craft one dataset from another
> weather <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/histWeather.csv")
> str(weather)
'data.frame':   130457 obs. of  24 variables:
 $ Date                     : Factor w/ 1159 levels "2014-07-01","2014-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Max_TemperatureF         : int  78 77 78 77 66 80 73 78 77 80 ...
 $ Mean_TemperatureF        : int  70 70 70 70 60 68 65 70 70 71 ...
 $ Min_TemperatureF         : int  62 64 62 62 53 57 57 62 62 60 ...
 $ Max_Dew_PointF           : int  70 70 72 70 66 59 64 72 68 64 ...
 $ MeanDew_PointF           : int  65 66 66 66 57 57 61 65 65 60 ...
 $ Min_DewpointF            : int  63 63 61 61 52 52 55 61 63 54 ...
 $ Max_Humidity             : int  100 100 100 100 100 94 100 100 100 100 ...
 $ Mean_Humidity            : int  89 91 95 92 90 65 87 91 88 74 ...
 $ Min_Humidity             : int  69 78 78 78 64 39 69 78 69 39 ...
 $ Max_Sea_Level_PressureIn : num  29.9 29.9 30 30 29.8 ...
 $ Mean_Sea_Level_PressureIn: num  29.8 29.9 29.9 29.9 29.7 ...
 $ Min_Sea_Level_PressureIn : num  29.8 29.8 29.9 29.8 29.5 ...
 $ Max_VisibilityMiles      : int  10 10 10 10 10 10 10 10 10 10 ...
 $ Mean_VisibilityMiles     : int  10 10 7 7 9 10 9 8 7 10 ...
 $ Min_VisibilityMiles      : int  10 6 0 0 4 10 5 1 0 5 ...
 $ Max_Wind_SpeedMPH        : int  16 22 14 13 25 16 20 15 20 15 ...
 $ Mean_Wind_SpeedMPH       : int  7 11 7 6 12 7 7 8 13 5 ...
 $ Max_Gust_SpeedMPH        : int  20 28 NA 21 39 24 24 24 25 17 ...
 $ PrecipitationIn          : Factor w/ 447 levels "","0","0.00",..: 2 2 2 29 319 2 4 4 2 6 ...
 $ CloudCover               : int  0 2 6 7 5 0 1 3 3 1 ...
 $ Events                   : Factor w/ 30 levels "","Fog","Fog-Rain",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ WindDirDegrees           : int  201 198 193 203 284 235 201 198 200 233 ...
 $ AirPtCd                  : Factor w/ 113 levels "KAAO","KACY",..: 14 14 14 14 14 14 14 14 14 14 ...

weather data

  • Let’s try a couple tasks that should help with the Team Project

    1. What are the coldest temperatures in the dataset?
    2. What is the minimum temperature by month and location?
    3. How many sunny days are there in each month and location?

Coldest temp in dataset?

  • How can we sort values in a DF?
  • One last command: arrange(data, var1) sorts from low to high by var1
    • descending order uses desc(var1)
> weather %>% select(AirPtCd, Min_TemperatureF, Date) %>% 
+   arrange(Min_TemperatureF) %>% head()
  AirPtCd Min_TemperatureF       Date
1    KCNM             -115 2016-11-05
2    KART              -37 2016-02-14
3    KART              -36 2015-02-16
4    KART              -32 2015-01-08
5    KART              -32 2015-02-13
6    KBIS              -31 2016-12-17

Min temp for month and location

  • We have date but not month or year
  • What dplyr command can we use to add month and year?
  • Let’s extract month and year from the Date variable
    • using lubridate package
> weather <- weather %>%
+   mutate(month = lubridate::month(Date, label = TRUE), year = lubridate::year(Date)) 
> str(weather %>% select(Date, month, year))
'data.frame':   130457 obs. of  3 variables:
 $ Date : Factor w/ 1159 levels "2014-07-01","2014-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ month: Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 7 7 7 7 7 7 7 7 7 7 ...
 $ year : num  2014 2014 2014 2014 2014 ...

Min temp for month and location

  • What commands should we use to get min temp by month and location?
> weather %>%
+   group_by(year, month, AirPtCd) %>%
+   summarize(minTmpF = min(Min_TemperatureF)) 
# A tibble: 4,381 x 4
# Groups:   year, month [39]
    year month AirPtCd minTmpF
   <dbl> <ord> <fct>     <dbl>
 1  2014 Jul   KAAO         54
 2  2014 Jul   KACY         56
 3  2014 Jul   KAGC         52
 4  2014 Jul   KALB         53
 5  2014 Jul   KAMA         58
 6  2014 Jul   KANJ         43
 7  2014 Jul   KART         48
 8  2014 Jul   KATL         59
 9  2014 Jul   KAVP         51
10  2014 Jul   KBAD         64
# ... with 4,371 more rows

Number of sunny days

  • CloudCover values of 0,1,2 indicate a sunny or mostly sunny day.
    • There are bogus entries with negative values (always check)
> summary(weather$CloudCover)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-440.000    1.000    3.000    3.484    6.000    8.000     2497 
  • How should we handle these bogus entries?
    • Replace with NA values
> weather <- weather %>%
+   mutate(CloudCover = ifelse(CloudCover <0, NA, CloudCover))
> summary(weather$CloudCover)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.000   3.000   3.487   6.000   8.000    2498 

Number of sunny days

  • How can we compute the number of days with sunny/mostly sunny conditions by month and location?
> weather2 <- weather %>%
+   group_by(year, month, AirPtCd) %>%
+   summarize(numSunDay = sum(CloudCover %in% c(0,1,2)), numDays=n() ) 
> glimpse(weather2)
Observations: 4,381
Variables: 5
Groups: year, month [39]
$ year      <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014...
$ month     <ord> Jul, Jul, Jul, Jul, Jul, Jul, Jul, Jul, Jul, Jul, Ju...
$ AirPtCd   <fct> KAAO, KACY, KAGC, KALB, KAMA, KANJ, KART, KATL, KAVP...
$ numSunDay <int> 25, 18, 9, 6, 16, 11, 11, 3, 9, 22, 15, 22, 3, 16, 9...
$ numDays   <int> 30, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, ...

Number of sunny days

  • We should always plot the data as a sanity check
> ggplot(weather2, aes(x=month, y=numSunDay)) + geom_boxplot() 

  • What do you think the issue is?

Number of sunny days

  • filter to help find too high readings.
> weather2 %>% filter(numSunDay > 31) %>% arrange(desc(numSunDay))
# A tibble: 6 x 5
# Groups:   year, month [6]
   year month AirPtCd numSunDay numDays
  <dbl> <ord> <fct>       <int>   <int>
1  2016 Aug   KLWS           58      62
2  2016 Jul   KLWS           46      62
3  2016 Jun   KLWS           44      60
4  2016 Sep   KLWS           42      60
5  2016 Apr   KLWS           32      60
6  2016 May   KLWS           32      62
  • What is wrong?

Number of sunny days

  • Seems like it might be a duplicates issue
> dim(weather) - dim(distinct(weather))
[1] 366   0
  • Is it a widespread issue or just limited to one month, year, location?

Number of sunny days

  • Create a subset with only the dupes so we can investigate
> weather_dupes <- weather %>% group_by(Date,AirPtCd) %>% summarize(n=n()) %>% 
+                                    arrange(desc(n)) %>% filter(n>=2)
> weather_dupes
# A tibble: 366 x 3
# Groups:   Date [366]
   Date       AirPtCd     n
   <fct>      <fct>   <int>
 1 2016-01-01 KLWS        2
 2 2016-01-02 KLWS        2
 3 2016-01-03 KLWS        2
 4 2016-01-04 KLWS        2
 5 2016-01-05 KLWS        2
 6 2016-01-06 KLWS        2
 7 2016-01-07 KLWS        2
 8 2016-01-08 KLWS        2
 9 2016-01-09 KLWS        2
10 2016-01-10 KLWS        2
# ... with 356 more rows

Number of sunny days

  • How many locations are affected?
> weather_dupes %>% group_by(AirPtCd) %>% 
+   summarize(n=n()) %>% 
+   arrange(desc(n))
# A tibble: 1 x 2
  AirPtCd     n
  <fct>   <int>
1 KLWS      366
  • Comforting to know it’s just that one location

Number of sunny days

  • Take out duplicates and look for our outliers
> weather2 <- weather %>% distinct() %>%
+   group_by(year, month, AirPtCd) %>%
+   summarize(numSunDay = sum(CloudCover %in% c(0,1,2)), numDays=n() ) 
> weather2 %>% filter(numSunDay > 31) %>% nrow(.)
[1] 0

Number of sunny days

  • How do the Max values look now?
> ggplot(weather2, aes(x=month, y=numSunDay)) + geom_boxplot()