Day 10

Data intake

  • Data from the real world may not always come in the form you expect
  • The process of importing data itself may already lead to false conclusions
    • Data intake in R makes assumptions about how to parse raw data
  • We should always view the raw data before importing
    • For smaller files, spreadsheets are useful
    • For larger files, unix commands like head or more are useful

Factor vs String:

  • Recall: factors (coded internally as integer levels) take less storage than strings
  • Sometimes mostly numeric variables contain a few character entries.
  • What happens when we coerce into a numeric variable?

String into numeric:

  • Character string variable
> (x <- c("4", "9","2.1","oops") )
[1] "4"    "9"    "2.1"  "oops"
> typeof(x)
[1] "character"
  • Coerce into numeric with as.numeric
> (y <- as.numeric(x))
Warning: NAs introduced by coercion
[1] 4.0 9.0 2.1  NA
> typeof(y)
[1] "double"
  • Coercion correctly converts to numeric vector with NA for non-numeric

Factor into numeric:

  • Factor variable
> (x <- factor(c("4", "9","2.1","oops")) )
[1] 4    9    2.1  oops
Levels: 2.1 4 9 oops
> typeof(x)
[1] "integer"
  • Coerce into numeric with as.numeric
> (y <- as.numeric(x))
[1] 2 3 1 4
> typeof(y)
[1] "double"
  • Coercion is incorrect: returns integer level value, not variable values

Factor into numeric:

  • Can get correct coercion with readr::parse_number
> y <- parse_number(as.character(x))  # vector with "problem" attribute
Warning: 1 parsing failure.
row col expected actual
  4  -- a number   oops
> typeof(y)
[1] "double"
> as.vector(y)  # plain vector
[1] 4.0 9.0 2.1  NA
  • Coercion is now correct

read_csv vs. read.csv:

  • the base R read.csv default setting makes all columns with characters a factor:
    • default: stringsAsFactors=TRUE
> energy1 <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv")
> glimpse(energy1[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <fct> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09-...
$ Musser_Hall <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
  • BAD: Default reads Timestamp is a factor
    • levels will be ordered numerically/alphabetically not chronologically
  • GOOD: Musser Hall energy values are dbl (double)

read_csv vs. read.csv:

  • the readr package read_csv default setting makes all columns with characters a character:
  • Default: it uses the first 1000 rows to guess variable types (guess_max)
> energy2 <- read_csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv")
> glimpse(energy2[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <dttm> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09...
$ Musser_Hall <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
  • GOOD: Default reads Timestamp is a date/time type called POSIXct
  • BAD: Musser Hall energy values are logical

read_csv vs. read.csv:

  • Summaries for energy1 and energy2 are not the same!
> summary(energy1$Musser_Hall)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   83.3   262.4   288.0   305.5   328.0 84953.2   22669 
> summary(energy2$Musser_Hall)
   Mode    NA's 
logical   35129 
  • The first 22,243 rows in energy1 are NA before they become numerical
> energy1[22238:22245,"Musser_Hall"]
[1]    NA    NA    NA    NA 420.9 398.0 370.2 327.4

read_csv vs. read.csv:

  • If read_csv guesses column type wrong you get a warning message:
> problems(energy2)
# A tibble: 12,460 x 5
     row col      expected      actual file                                
   <int> <chr>    <chr>         <chr>  <chr>                               
 1 22242 Musser_~ 1/0/T/F/TRUE~ 420.9  'https://raw.githubusercontent.com/~
 2 22243 Musser_~ 1/0/T/F/TRUE~ 398    'https://raw.githubusercontent.com/~
 3 22244 Musser_~ 1/0/T/F/TRUE~ 370.2  'https://raw.githubusercontent.com/~
 4 22245 Musser_~ 1/0/T/F/TRUE~ 327.4  'https://raw.githubusercontent.com/~
 5 22246 Musser_~ 1/0/T/F/TRUE~ 322.6  'https://raw.githubusercontent.com/~
 6 22247 Musser_~ 1/0/T/F/TRUE~ 357.4  'https://raw.githubusercontent.com/~
 7 22248 Musser_~ 1/0/T/F/TRUE~ 378.3  'https://raw.githubusercontent.com/~
 8 22249 Musser_~ 1/0/T/F/TRUE~ 379.8  'https://raw.githubusercontent.com/~
 9 22250 Musser_~ 1/0/T/F/TRUE~ 367.6  'https://raw.githubusercontent.com/~
10 22251 Musser_~ 1/0/T/F/TRUE~ 355.3  'https://raw.githubusercontent.com/~
# ... with 12,450 more rows

read_csv vs. read.csv:

  • What happened in row 22,242?
  • Value from read.csv:
> energy1$Musser_Hall[22242]
[1] 420.9
  • Value from read_csv:
> energy2$Musser_Hall[22242]
[1] NA

read_csv vs. read.csv:

  • read_csv guesses data type for each column
    • then uses parse functions to coerce the entire column
    • parse failures result in an NA entry
> energy1$Musser_Hall[22242]
[1] 420.9
> parse_logical(as.character(energy1$Musser_Hall[22242]))
Warning: 1 parsing failure.
row col           expected actual
  1  -- 1/0/T/F/TRUE/FALSE  420.9
[1] NA
attr(,"problems")
# A tibble: 1 x 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     1    NA 1/0/T/F/TRUE/FALSE 420.9 

read_csv vs. read.csv:

  • Solution:
    • check for warnings, define column types if needed
    • or increase guess_max value

read_csv vs. read.csv:

  • Timestamp is a datetime variable
  • dayWeek is a factor with ordered levels
  • all other read in as double
> energy <- read_csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv", 
+                    col_type = cols(.default = col_double(), 
+                           Timestamp = col_datetime(format = ""),
+                           dayWeek = col_factor(
+                               levels=c("Mon","Tues","Wed","Thurs","Fri","Sat","Sun"))))

read_csv vs. read.csv:

  • No warning messages in problems
  • Check again:
> glimpse(energy[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <dttm> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09...
$ Musser_Hall <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
  • Row 22,242 is correct:
> energy$Musser_Hall[22242]
[1] 420.9

Dates and Times:

  • R has built-in functions to handle dates and times
  • Stores dates, times, or date-times as either:
    • POSIXct type: time in seconds since start of 1970
    • POSIXlt type: list of vector of time/date units
  • The lubridate package makes dates and times manipulations a little easier

lubridate: extract info

  • Let’s look at the 5th Timestamp in the energy data:
> ( stamp5 <- energy$Timestamp[5] )
[1] "2015-09-01 01:00:00 UTC"
> str(stamp5)
 POSIXct[1:1], format: "2015-09-01 01:00:00"
  • What day of the week and day of the year was it?
> wday(stamp5, label=TRUE)
[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
> yday(stamp5)
[1] 244

lubridate: extract info

  • We can also extract date and time info:
> stamp5
[1] "2015-09-01 01:00:00 UTC"
> year(stamp5)
[1] 2015
> month(stamp5)
[1] 9
> day(stamp5)
[1] 1
> hour(stamp5)
[1] 1
> minute(stamp5)
[1] 0

lubridate: create dates and times

  • There are many variations of commands like:
> (mydate <- mdy("1/24/2018") )
[1] "2018-01-24"
> ymd("2018-01-24")
[1] "2018-01-24"
> ymd_hm("2018-01-24 13:32")
[1] "2018-01-24 13:32:00 UTC"
  • date part: y=year, m=month, d=day
  • time part: h=hour, m=minute, s=second

lubridate: create dates and times

  • Also make_date or make_datetime
> energy %>% mutate(datetime = make_datetime(year,month,dayOfMonth,timeHour,timeMinute)) %>%
+   select(Timestamp, datetime, year,month,dayOfMonth,timeHour,timeMinute)
# A tibble: 35,129 x 7
   Timestamp           datetime             year month dayOfMonth timeHour
   <dttm>              <dttm>              <dbl> <dbl>      <dbl>    <dbl>
 1 2015-09-01 00:00:00 2015-09-01 00:00:00  2015     9          1        0
 2 2015-09-01 00:15:00 2015-09-01 00:15:00  2015     9          1        0
 3 2015-09-01 00:30:00 2015-09-01 00:30:00  2015     9          1        0
 4 2015-09-01 00:45:00 2015-09-01 00:45:00  2015     9          1        0
 5 2015-09-01 01:00:00 2015-09-01 01:00:00  2015     9          1        1
 6 2015-09-01 01:15:00 2015-09-01 01:15:00  2015     9          1        1
 7 2015-09-01 01:30:00 2015-09-01 01:30:00  2015     9          1        1
 8 2015-09-01 01:45:00 2015-09-01 01:45:00  2015     9          1        1
 9 2015-09-01 02:00:00 2015-09-01 02:00:00  2015     9          1        2
10 2015-09-01 02:15:00 2015-09-01 02:15:00  2015     9          1        2
# ... with 35,119 more rows, and 1 more variable: timeMinute <dbl>

lubridate: updating dates and times

  • Can modify an existing date with update
> mydate
[1] "2018-01-24"
> update(mydate, year = 2020)
[1] "2020-01-24"
> update(mydate, hour=12, minute = 23)
[1] "2018-01-24 12:23:00 UTC"

lubridate: measuring time

  • Suppose we want to see how much time has elapsed since the start of the term:
> firstDay<- mdy("9/3/2019")
> firstDay
[1] "2019-09-03"
  • We can get a date/time with now() and a date with today()
> today()
[1] "2019-10-03"
> now()
[1] "2019-10-03 13:02:03 PDT"

lubridate: measuring time

  • interval creates an “interval” class object that defines an interval of time (tied to specific dates)
> interval(firstDay, today())  
[1] 2019-09-03 UTC--2019-10-03 UTC
  • duration specifies a length of time (not tied to specific dates)
> dminutes(2)
[1] "120s (~2 minutes)"
> dhours(1)
[1] "3600s (~1 hours)"
> ddays(1)
[1] "86400s (~1 days)"

lubridate: measuring time

  • We use an interval and duration to compute a length of time
  • How many days so far this term?
> interval(firstDay, today()) / ddays(1) 
[1] 30
  • How many weeks? minutes?
> interval(firstDay, today()) / dweeks(1)
[1] 4.285714
> interval(firstDay, today()) / dminutes(1)
[1] 43200

lubridate: measuring time

  • Can also use date/time variables:
  • Difference between first two time stamps:
> energy$Timestamp[1:2]
[1] "2015-09-01 00:00:00 UTC" "2015-09-01 00:15:00 UTC"
> interval(energy$Timestamp[1], energy$Timestamp[2])/dminutes(1)
[1] 15
  • 15 minutes between readings 1 and 2