R: nuts and bolts

Day 3

Where do things live?

Working Directory

Directory where R first looks for files.
- If you run read.csv("mydata.csv") then mydata.csv should be in your working directory
Check current location (your location will be different):

> getwd()
[1] "C:/Users/mgelman/Dropbox/econ122-f19/docs"

R Markdown: When you knit a .Rmd file, the working directory for the compilation of the document is always the directory where the .Rmd file is located!
- This may be different from the location of your current Rstudio session.

Where do things live?

Workspace

This is the “environment” where R objects live.
- See the Environment tab in Rstudio
Check the contents of my (or your) environment

> ls()
character(0)

(If it says character(0) then your workspace is empty!)

R Markdown: When you knit a .Rmd file, a completely new and separate working space is created.
- why do you think this is the case?
- R chunk may throw an error if you reference an object loaded in your workspace but not via R chunk

Where do things live?

Rstudio Projects

Projects set your working directory to the folder it lives in
- easy way to set your working directory
- can save and reload your workspace (environment)
Projects can be connected to GitHub using Rstudio’s GUI
Highly recommend: create a project for this class in the folder you are using to store class work.
- see Hadley’s page on projects

Objects in R

Anything created or imported into R is called an object
- vectors, data frames, matrices, lists, functions, lm, …
We usually store objects in the workspace using the assignment operator <-

> x <- c(8,2,1,3)
> ls()
[1] "x"

The = operator also does assignment, but it is mainly used for argument specification inside a function.

> y <- rnorm(3, mean=10, sd=2)
> ls()
[1] "x" "y"

Please don’t use = for variable assignment!!!

Data structures and types

Shape and computer storage

Data types

Determines computer storage of info

Important data types
- Logical: TRUE and FALSE are the only values
- Numeric class: Integer and double
- Character: string ("") of text

> typeof(x)  # type of storage mode 
[1] "double"
> class(x)   # object class is numeric
[1] "numeric"
> typeof("abc")
[1] "character"
> x == 1
[1] FALSE FALSE  TRUE FALSE
> typeof(x == 1)
[1] "logical"

Vectors

Shape of an object

R uses two types of vectors to store info
- atomic vectors: all entries have the same data type
- lists: entries can contain other objects that can differ in data type
All vectors have a length

> x
[1] 8 2 1 3
> length(x)
[1] 4
> x.list<- list(x,1,"a")
> length(x.list)
[1] 3

Atomic Vectors: Matrices

You can add attributes, such as dimension, to vectors
A matrix is a 2-dimensional vector containing entries of the same type

> x
[1] 8 2 1 3
> x.mat <- matrix(x, nrow=2, byrow=TRUE)
> x.mat
     [,1] [,2]
[1,]    8    2
[2,]    1    3
> class(x.mat)
[1] "matrix"
> str(x.mat)
 num [1:2, 1:2] 8 1 2 3

Atomic Vectors: Matrices

or you can bind vectors of the same length to create columns or rows:

> x.mat2 <- cbind(x,2*x)
> x.mat2
     x   
[1,] 8 16
[2,] 2  4
[3,] 1  2
[4,] 3  6

Lists: Data frames

A data frame is a list of atomic vectors of the same length, but not necessarily the same data type
the loans data frame has columns that are integer and factor types

> loans <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/CreditData.csv")
> class(loans)
[1] "data.frame"
> typeof(loans)
[1] "list"
> str(loans)
'data.frame':   1000 obs. of  21 variables:
 $ Status.of.existing.checking.account                     : Factor w/ 4 levels "... < 0 DM","... >= 200 DM / salary assignments for at least 1 year",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ Duration.in.month                                       : int  6 48 12 42 24 36 24 36 12 30 ...
 $ Credit.history                                          : Factor w/ 5 levels "all credits at this bank paid back duly",..: 2 4 2 4 3 4 4 4 4 2 ...
 $ Purpose                                                 : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
 $ Credit.amount                                           : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ Savings.account.bonds                                   : Factor w/ 5 levels ".. >= 1000 DM",..: 5 2 2 2 2 5 4 2 1 2 ...
 $ Present.employment.since                                : Factor w/ 5 levels ".. >= 7 years",..: 1 3 4 4 3 3 1 3 4 5 ...
 $ Installment.rate.in.percentage.of.disposable.income     : int  4 2 2 2 3 2 3 2 2 4 ...
 $ Personal.status.and.sex                                 : Factor w/ 4 levels "female : divorced/separated/married",..: 4 1 4 4 4 4 4 4 2 3 ...
 $ Other.debtors.guarantors                                : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
 $ Present.residence.since                                 : int  4 2 3 4 4 4 4 2 4 2 ...
 $ Property                                                : Factor w/ 4 levels "if not A121 : building society savings agreement/life insurance",..: 3 3 3 1 4 4 1 2 3 2 ...
 $ Age.in.years                                            : int  67 22 49 45 53 35 53 35 61 28 ...
 $ Other.installment.plans                                 : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Housing                                                 : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
 $ Number.of.existing.credits.at.this.bank                 : int  2 1 1 1 2 1 1 1 1 2 ...
 $ Job                                                     : Factor w/ 4 levels "management/ self-employed/highly qualified employee/ officer",..: 2 2 4 2 2 4 2 1 4 1 ...
 $ Number.of.people.being.liable.to.provide.maintenance.for: int  1 1 2 2 2 2 1 1 1 1 ...
 $ Telephone                                               : Factor w/ 2 levels "none","yes, registered under the customers name": 2 1 1 1 1 2 1 2 1 1 ...
 $ foreign.worker                                          : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ Good.Loan                                               : Factor w/ 2 levels "BadLoan","GoodLoan": 2 1 2 2 1 2 2 2 2 1 ...

Lists: Data frames

data frame attributes include column names and row.names
you can create a data frame with the data.frame command

> x.df <- data.frame(x=x,double.x=x*2)
> x.df
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6
> attributes(x.df)  # note: attributes returns a list!
$names
[1] "x"        "double.x"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4

Data types: factors

Factors are a data type stored as integers
- The attribute levels is a character vector of possible values
- Values are stored as the integers (1=first level, 2=second level, etc)

> class(loans$Good.Loan)
[1] "factor"
> typeof(loans$Good.Loan)
[1] "integer"
> levels(loans$Good.Loan)
[1] "BadLoan"  "GoodLoan"
> str(loans$Good.Loan)
 Factor w/ 2 levels "BadLoan","GoodLoan": 2 1 2 2 1 2 2 2 2 1 ...
> head(loans$Good.Loan)
[1] GoodLoan BadLoan  GoodLoan GoodLoan BadLoan  GoodLoan
Levels: BadLoan GoodLoan

Object Oriented Programming

In R, commands care about object class and type
Ex: For a factor, what type of summary would be helpful?

> summary(loans$Good.Loan)
 BadLoan GoodLoan 
     300      700

What about for a numeric?

> summary(loans$Duration.in.month)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    18.0    20.9    24.0    72.0

In your Console window, type ?summary then hit tab.
- see summary.default, summary.factor, …

Coercion

Entries in atomic vectors must be the same data type
R will default to the most complex data type if more than one type is given

> y <- c(1,2,"a")
> y
[1] "1" "2" "a"
> typeof(y)
[1] "character"

Coercion

Logical values coerced into 0 for FALSE and 1 for TRUE

> z <- c(TRUE, FALSE, TRUE, 7)
> z   # TRUE = 1, FALSE = 0
[1] 1 0 1 7
> typeof(z)
[1] "double"

Logical vectors are also coerced into numeric when applying math functions

> x
[1] 8 2 1 3
> x >= 5  # which entries >= 5?
[1]  TRUE FALSE FALSE FALSE
> sum(x >= 5)  # how many >=5 ?
[1] 1
> mean(x >= 5) # proportion of entries >=5
[1] 0.25

Subsetting: Atomic Vector

subset with [] by referencing index value (from 1 to vector length):

> x
[1] 8 2 1 3
> x[c(4,2)]  # get 4th and 2nd entries
[1] 3 2

subset by omitting entries

> x[-c(4,2)]  # omit 4th and 2nd entries
[1] 8 1

subset with a logical vector

> x[c(TRUE,FALSE,TRUE,FALSE)]  # get 1st and 3rd entries
[1] 8 1

Subsetting: Matrices

access entries using subsetting [row,column]

> x.mat2
     x   
[1,] 8 16
[2,] 2  4
[3,] 1  2
[4,] 3  6
> x.mat2[,1] # first column
[1] 8 2 1 3
> x.mat2[1:2,1] # first 2 rows of first column
[1] 8 2

R doesn’t always preserve class:

> class(x.mat2[1,])  # one row (or col) is no longer a matrix (1D)
[1] "numeric"

Subsetting: Data frames

you can access entries like a matrix:

> x.df
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6
> x.df[,1]  # first column, all rows
[1] 8 2 1 3
> class(x.df[,1])  # first column is no longer a data frame
[1] "numeric"

One column of a data frame is no longer a data frame

Subsetting: Data frames

but one row with 2 or more columns (variables) is still a data frame

> x.df[1, 1:2]  # 1 row, 2 columns
  x double.x
1 8       16
> class(x.df[1, 1:2])
[1] "data.frame"

Remember that data frames are built from lists of variables (vectors) so they are different objects than matrices

Subsetting: Data frames

or access columns with $

> x.df$x  # get variable x column
[1] 8 2 1 3

you can also use column names to subset:

> loans[1:2,c("Good.Loan","Credit.amount")] # get 2 rows of Good.Loan and Credit.amount
  Good.Loan Credit.amount
1  GoodLoan          1169
2   BadLoan          5951

Subsetting: Lists

Recall: a list is a vector with entries that can be different object types

> my.list <- list(myVec=x, myDf=x.df, myString=c("hi","bye"))
> my.list
$myVec
[1] 8 2 1 3

$myDf
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6

$myString
[1] "hi"  "bye"

Subsetting: Lists

one [] operator gives you the object at the given location but preserves the list type
my.list[2] return a list of length one with entry myDf

> my.list[2]
$myDf
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6
> str(my.list[2])
List of 1
 $ myDf:'data.frame':   4 obs. of  2 variables:
  ..$ x       : num [1:4] 8 2 1 3
  ..$ double.x: num [1:4] 16 4 2 6

Subsetting: Lists

the double [[]] operator gives you the object stored at that location
- can enter location number or entry name
my.list[[2]] or my.list[["myDf"]] return the data frame myDf

> my.list[[2]]
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6
> str(my.list[[2]])
'data.frame':   4 obs. of  2 variables:
 $ x       : num  8 2 1 3
 $ double.x: num  16 4 2 6
> my.list[["myDf"]]
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6

Subsetting: Lists

Like a data frame, can use the $ to access objects stored in the list
- equivalent to using [[]]
my.list$myDf return the data frame myDf

> my.list$myDf
  x double.x
1 8       16
2 2        4
3 1        2
4 3        6
> class(my.list$myDf)
[1] "data.frame"

Functions

a function takes in objects and arguments and produces a new object
here are the various types of mean functions in R: (can depend on packages loaded)

> methods(mean)
[1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt 
see '?methods' for accessing help and source code

Functions

the default mean function code is here:

> mean.default
function (x, trim = 0, na.rm = FALSE, ...) 
{
    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
        warning("argument is not numeric or logical: returning NA")
        return(NA_real_)
    }
    if (na.rm) 
        x <- x[!is.na(x)]
    if (!is.numeric(trim) || length(trim) != 1L) 
        stop("'trim' must be numeric of length one")
    n <- length(x)
    if (trim > 0 && n) {
        if (is.complex(x)) 
            stop("trimmed means are not defined for complex data")
        if (anyNA(x)) 
            return(NA_real_)
        if (trim >= 0.5) 
            return(stats::median(x, na.rm = FALSE))
        lo <- floor(n * trim) + 1
        hi <- n + 1 - lo
        x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
    }
    .Internal(mean(x))
}
<bytecode: 0x0000000038d12b70>
<environment: namespace:base>

Functions

Function creation is not something we will emphasize in this Data Science course, but you should be able to create simple functions

my.function <- function(arguments)
{
  code that does the work
  
  return(list(objects that you are returning))
}

If you are returning only one object, you don’t need the list function in return

Functions

This function returns the mean and SD of a vector and (possibly) creates a histogram
- input: variable x, plotting argument (default is FALSE)
- …: optional arguments that might be used in mean and sd
- output: list of mean and sd

> MeanSD <- function(x,plot=FALSE,...)
+ {
+   mean.x <- mean(x,...)
+   sd.x <- sd(x,...)
+   if (plot) 
+     hist(x)
+   return(list(Mean=mean.x,SD=sd.x))
+ }

Functions

A simple example:

> MeanSD(loans$Credit.amount)
$Mean
[1] 3271.258

$SD
[1] 2822.737

Function output is a list:

> str(MeanSD(loans$Credit.amount))
List of 2
 $ Mean: num 3271
 $ SD  : num 2823

Functions

Add plotting argument:

> MeanSD(loans$Credit.amount, plot=TRUE)

$Mean
[1] 3271.258

$SD
[1] 2822.737