Data intake and Dates

Day 10

Data intake

Data from the real world may not always come in the form you expect
The process of importing data itself may already lead to false conclusions
- Data intake in R makes assumptions about how to parse raw data
We should always view the raw data before importing
- For smaller files, spreadsheets are useful
- For larger files, unix commands like head or more are useful

Factor vs String:

Recall: factors (coded internally as integer levels) take less storage than strings
Sometimes mostly numeric variables contain a few character entries.
What happens when we coerce into a numeric variable?

String into numeric:

Character string variable

> (x <- c("4", "9","2.1","oops") )
[1] "4"    "9"    "2.1"  "oops"
> typeof(x)
[1] "character"

Coerce into numeric with as.numeric

> (y <- as.numeric(x))
Warning: NAs introduced by coercion
[1] 4.0 9.0 2.1  NA
> typeof(y)
[1] "double"

Coercion correctly converts to numeric vector with NA for non-numeric

Factor into numeric:

Factor variable

> (x <- factor(c("4", "9","2.1","oops")) )
[1] 4    9    2.1  oops
Levels: 2.1 4 9 oops
> typeof(x)
[1] "integer"

Coerce into numeric with as.numeric

> (y <- as.numeric(x))
[1] 2 3 1 4
> typeof(y)
[1] "double"

Coercion is incorrect: returns integer level value, not variable values

Factor into numeric:

Can get correct coercion with readr::parse_number

> y <- parse_number(as.character(x))  # vector with "problem" attribute
Warning: 1 parsing failure.
row col expected actual
  4  -- a number   oops
> typeof(y)
[1] "double"
> as.vector(y)  # plain vector
[1] 4.0 9.0 2.1  NA

Coercion is now correct

`read_csv` vs. `read.csv`:

the base R read.csv default setting makes all columns with characters a factor:
- default: stringsAsFactors=TRUE

> energy1 <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv")
> glimpse(energy1[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <fct> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09-...
$ Musser_Hall <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

BAD: Default reads Timestamp is a factor
- levels will be ordered numerically/alphabetically not chronologically
GOOD: Musser Hall energy values are dbl (double)

`read_csv` vs. `read.csv`:

the readr package read_csv default setting makes all columns with characters a character:
Default: it uses the first 1000 rows to guess variable types (guess_max)

> energy2 <- read_csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv")
> glimpse(energy2[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <dttm> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09...
$ Musser_Hall <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

GOOD: Default reads Timestamp is a date/time type called POSIXct
BAD: Musser Hall energy values are logical

`read_csv` vs. `read.csv`:

Summaries for energy1 and energy2 are not the same!

> summary(energy1$Musser_Hall)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   83.3   262.4   288.0   305.5   328.0 84953.2   22669 
> summary(energy2$Musser_Hall)
   Mode    NA's 
logical   35129

The first 22,243 rows in energy1 are NA before they become numerical

> energy1[22238:22245,"Musser_Hall"]
[1]    NA    NA    NA    NA 420.9 398.0 370.2 327.4

`read_csv` vs. `read.csv`:

If read_csv guesses column type wrong you get a warning message:

> problems(energy2)
# A tibble: 12,460 x 5
     row col      expected      actual file                                
   <int> <chr>    <chr>         <chr>  <chr>                               
 1 22242 Musser_~ 1/0/T/F/TRUE~ 420.9  'https://raw.githubusercontent.com/~
 2 22243 Musser_~ 1/0/T/F/TRUE~ 398    'https://raw.githubusercontent.com/~
 3 22244 Musser_~ 1/0/T/F/TRUE~ 370.2  'https://raw.githubusercontent.com/~
 4 22245 Musser_~ 1/0/T/F/TRUE~ 327.4  'https://raw.githubusercontent.com/~
 5 22246 Musser_~ 1/0/T/F/TRUE~ 322.6  'https://raw.githubusercontent.com/~
 6 22247 Musser_~ 1/0/T/F/TRUE~ 357.4  'https://raw.githubusercontent.com/~
 7 22248 Musser_~ 1/0/T/F/TRUE~ 378.3  'https://raw.githubusercontent.com/~
 8 22249 Musser_~ 1/0/T/F/TRUE~ 379.8  'https://raw.githubusercontent.com/~
 9 22250 Musser_~ 1/0/T/F/TRUE~ 367.6  'https://raw.githubusercontent.com/~
10 22251 Musser_~ 1/0/T/F/TRUE~ 355.3  'https://raw.githubusercontent.com/~
# ... with 12,450 more rows

`read_csv` vs. `read.csv`:

What happened in row 22,242?
Value from read.csv:

> energy1$Musser_Hall[22242]
[1] 420.9

Value from read_csv:

> energy2$Musser_Hall[22242]
[1] NA

`read_csv` vs. `read.csv`:

read_csv guesses data type for each column
- then uses parse functions to coerce the entire column
- parse failures result in an NA entry

> energy1$Musser_Hall[22242]
[1] 420.9
> parse_logical(as.character(energy1$Musser_Hall[22242]))
Warning: 1 parsing failure.
row col           expected actual
  1  -- 1/0/T/F/TRUE/FALSE  420.9
[1] NA
attr(,"problems")
# A tibble: 1 x 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     1    NA 1/0/T/F/TRUE/FALSE 420.9

`read_csv` vs. `read.csv`:

Solution:
- check for warnings, define column types if needed
- or increase guess_max value

`read_csv` vs. `read.csv`:

Timestamp is a datetime variable
dayWeek is a factor with ordered levels
all other read in as double

> energy <- read_csv("https://raw.githubusercontent.com/mgelman/data/master/EnergyData1516.csv", 
+                    col_type = cols(.default = col_double(), 
+                           Timestamp = col_datetime(format = ""),
+                           dayWeek = col_factor(
+                               levels=c("Mon","Tues","Wed","Thurs","Fri","Sat","Sun"))))

`read_csv` vs. `read.csv`:

No warning messages in problems
Check again:

> glimpse(energy[,c("Timestamp","Musser_Hall")])
Observations: 35,129
Variables: 2
$ Timestamp   <dttm> 2015-09-01 00:00:00, 2015-09-01 00:15:00, 2015-09...
$ Musser_Hall <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

Row 22,242 is correct:

> energy$Musser_Hall[22242]
[1] 420.9

Dates and Times:

R has built-in functions to handle dates and times
Stores dates, times, or date-times as either:
- POSIXct type: time in seconds since start of 1970
- POSIXlt type: list of vector of time/date units
The lubridate package makes dates and times manipulations a little easier

`lubridate`: extract info

Let’s look at the 5th Timestamp in the energy data:

> ( stamp5 <- energy$Timestamp[5] )
[1] "2015-09-01 01:00:00 UTC"
> str(stamp5)
 POSIXct[1:1], format: "2015-09-01 01:00:00"

What day of the week and day of the year was it?

> wday(stamp5, label=TRUE)
[1] Tue
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
> yday(stamp5)
[1] 244

`lubridate`: extract info

We can also extract date and time info:

> stamp5
[1] "2015-09-01 01:00:00 UTC"
> year(stamp5)
[1] 2015
> month(stamp5)
[1] 9
> day(stamp5)
[1] 1
> hour(stamp5)
[1] 1
> minute(stamp5)
[1] 0

`lubridate`: create dates and times

There are many variations of commands like:

> (mydate <- mdy("1/24/2018") )
[1] "2018-01-24"
> ymd("2018-01-24")
[1] "2018-01-24"
> ymd_hm("2018-01-24 13:32")
[1] "2018-01-24 13:32:00 UTC"

date part: y=year, m=month, d=day
time part: h=hour, m=minute, s=second

`lubridate`: create dates and times

Also make_date or make_datetime

> energy %>% mutate(datetime = make_datetime(year,month,dayOfMonth,timeHour,timeMinute)) %>%
+   select(Timestamp, datetime, year,month,dayOfMonth,timeHour,timeMinute)
# A tibble: 35,129 x 7
   Timestamp           datetime             year month dayOfMonth timeHour
   <dttm>              <dttm>              <dbl> <dbl>      <dbl>    <dbl>
 1 2015-09-01 00:00:00 2015-09-01 00:00:00  2015     9          1        0
 2 2015-09-01 00:15:00 2015-09-01 00:15:00  2015     9          1        0
 3 2015-09-01 00:30:00 2015-09-01 00:30:00  2015     9          1        0
 4 2015-09-01 00:45:00 2015-09-01 00:45:00  2015     9          1        0
 5 2015-09-01 01:00:00 2015-09-01 01:00:00  2015     9          1        1
 6 2015-09-01 01:15:00 2015-09-01 01:15:00  2015     9          1        1
 7 2015-09-01 01:30:00 2015-09-01 01:30:00  2015     9          1        1
 8 2015-09-01 01:45:00 2015-09-01 01:45:00  2015     9          1        1
 9 2015-09-01 02:00:00 2015-09-01 02:00:00  2015     9          1        2
10 2015-09-01 02:15:00 2015-09-01 02:15:00  2015     9          1        2
# ... with 35,119 more rows, and 1 more variable: timeMinute <dbl>

`lubridate`: updating dates and times

Can modify an existing date with update

> mydate
[1] "2018-01-24"
> update(mydate, year = 2020)
[1] "2020-01-24"
> update(mydate, hour=12, minute = 23)
[1] "2018-01-24 12:23:00 UTC"

`lubridate`: measuring time

Suppose we want to see how much time has elapsed since the start of the term:

> firstDay<- mdy("9/3/2019")
> firstDay
[1] "2019-09-03"

We can get a date/time with now() and a date with today()

> today()
[1] "2019-10-03"
> now()
[1] "2019-10-03 13:02:03 PDT"

`lubridate`: measuring time

interval creates an “interval” class object that defines an interval of time (tied to specific dates)

> interval(firstDay, today())  
[1] 2019-09-03 UTC--2019-10-03 UTC

duration specifies a length of time (not tied to specific dates)

> dminutes(2)
[1] "120s (~2 minutes)"
> dhours(1)
[1] "3600s (~1 hours)"
> ddays(1)
[1] "86400s (~1 days)"

`lubridate`: measuring time

We use an interval and duration to compute a length of time
How many days so far this term?

> interval(firstDay, today()) / ddays(1) 
[1] 30

How many weeks? minutes?

> interval(firstDay, today()) / dweeks(1)
[1] 4.285714
> interval(firstDay, today()) / dminutes(1)
[1] 43200

`lubridate`: measuring time

Can also use date/time variables:
Difference between first two time stamps:

> energy$Timestamp[1:2]
[1] "2015-09-01 00:00:00 UTC" "2015-09-01 00:15:00 UTC"
> interval(energy$Timestamp[1], energy$Timestamp[2])/dminutes(1)
[1] 15

15 minutes between readings 1 and 2

Data intake

Factor vs String:

String into numeric:

Factor into numeric:

Factor into numeric:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

read_csv vs. read.csv:

Dates and Times:

lubridate: extract info

lubridate: extract info

lubridate: create dates and times

lubridate: create dates and times

lubridate: updating dates and times

lubridate: measuring time

lubridate: measuring time

lubridate: measuring time

lubridate: measuring time

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`read_csv` vs. `read.csv`:

`lubridate`: extract info

`lubridate`: extract info

`lubridate`: create dates and times

`lubridate`: create dates and times

`lubridate`: updating dates and times

`lubridate`: measuring time

`lubridate`: measuring time

`lubridate`: measuring time

`lubridate`: measuring time