Reproducibilty and R Markdown

Day 2

Quick Comment: Re-compiling Lecture Slides

While click-through slides are useful for presentations, they may not be the best for review
You can Knit the .Rmd file for the slides into a more conveient format
- You can turn incremental off
- You can output it to an md file
- You can output it as beamer_presentation

Quick Comment: Files on GitHub

Log onto your test-assignment GitHub repo

Last ditch homework submission:
- If you have problems pushing your work, upload it to the appropriate repo with the Upload Files button
“I don’t want to clone your repository but I want your code”:
- Click on test-activity.Rmd. GitHub recognized basic Markdown formatting, but can’t run R code chunks
- To view .Rmd code: Click the Raw button on right side
- To download: Right click and select “Save As” and save as .Rmd (not .txt) file.
- OR, just copy and paste into a new R Markdown file.
Don’t edit submitted assignment files after submission!

Replicability vs. Reproducibility

Scientific findings should be replicatable
- Ann repeats Bob’s lab experiment and gets different data but makes the same conclusions as Bob
Statistical findings should be reproducible
- Ann takes Bob’s data and gets the exact same statistical results as Bob
Statistical findings should be easily reproducible
- Ann only needs to hit one button to reproduce Bob’s results.
- Ann only needs to hit one button to reproduce Bob’s analysis on a new data set

Email from your boss:

Hi [Insert your name]
Your report was very interesting, but you used an old version of the data.  
Our newest loan data has 1000 loan cases, not 700 like you used.  

I'm really sorry you did all that work on the wrong dataset, but I expect you will 
easily be able to rerun your analysis on the newest data and get me a new report shortly. 

LOL, 
Barb

Open up Rstudio

Open your test-assignment Rstudio Project
- Click on the Files tab (lower right window)
- Check the box by test-assignment.Rmd
- Select menu More > Copy and rename it Barb.Rmd
Make this change to your first R chunk then knit your .Rmd:

> loans <- read.csv("https://raw.githubusercontent.com/mgelman/data/master/CreditData.csv")

This loads the correct (1000 case) data set.
Do your written answers match the R results?!

Full disclosure to avoid plagarism charges

Some of my ideas and comments in these slides were taken from Karl Broman’s talk on “Steps toward Reproducible Research” that he presented at JSM 2016 in Chicago.
And from Yihui Xie’s (creator of knitr) evil classroom assignment idea.

Making your work reproducible

You need a scriptable program (e.g. R, Python)
- forces you to record the linear sequence of events in an analysis
- should be able to retrace your steps
- avoid point-n-click!
- avoid any “by hand” actions (e.g. data cleaning in Excel)

But scriptable doesn’t always mean reproducible

You should make your workflow (“soup to nuts”) transparent and easily followed
- meaningful file and variable names
- don’t overly complicate code, use packages when only when needed (the fewer dependencies the better)
- only relevant code included
- written description of your analysis process and results alongside your code

Why should we strive for reproducibility?

Karl Broman paraphrasing Mark Holder:
- Your closest collaborator is you six months ago, but you don’t reply to emails.
- You need to document your workflow for both yourself and current/future collaborators
Can you open one of your old stats homework scripts and understand
- the data used?
- the questions you were trying to answer?
- what your results mean?
Statistical Consulting
- from term to term we can continue, repeat or redo projects!

Reproducibility using R Markdown

R Markdown is a literate programming language that integrates R code, results and write-up.
- literate = it is readable and easy to learn
There are other programs, like Sweave, that also integrate R code, results and write-up.
- not so readable and steeper learning curve

R Markdown: how it works

You create a .Rmd file
You click the knit button in Rstudio (or run the rmarkdown package command render("my.Rmd")).
A compiled html/pdf/word/… document magically appears

Behind the scenes:
- rmarkdown uses knitr package to create a .md (Markdown) file
- the document converter pandoc takes the .md file and creates a html/pdf/word… document!

Writing a R Markdown document

1. Formatting text

References: Hadley’s page, Rstudio page, RStudio cheatsheet
Simple rules for
- section headers (#,##,etc)
- lists (need ~2 tabs to create sublists)
- formatting (bold **, italics *)
- tables
- R syntax (use backward tick `)
- web links [linked text](url)
- latex math equations \(\beta_1 + \beta_2\)
- in HTML docs, you can use HTML commands (in pdf, latex commands)

Writing a R Markdown document

2. R chunks

Begin with ```{r} and end with ```
Can run code within chunk in the R console
- as an entire chunk
- line by line

Writing a R Markdown document

2. R chunk options

many options control the output produced/shown by a chunk

```{r, options}
code stuff
```

echo=FALSE omits R code but not results
include=FALSE omits R code and results
eval=FALSE code chunk is not run, but code displayed
results='hide' or fig.show='hide' runs code but suppresses output
message=FALSE or warning=FALSE suppresses messages/warnings in output
error=TRUE knits doc even with a code error

Writing a R Markdown document

2. R chunk global options

Set global document chunk options at the start of your file:

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE, prompt=TRUE, comment=NULL, 
eval=TRUE, message=FALSE, include=TRUE)
```

evaluate R chunks and include results, suppress (package) messages
prompt=TRUE shows > prompt in front of commands
comment=NULL no comment ## in front of output
collapse=TRUE compresses code/results in the chunk output

Writing a R Markdown document

2. R chunk names

Name your code chunks:

```{r ChunkName1} 
my fancy code chunk
```

can see in drop-down menu (lower left)
can reuse named chunks by just referencing chunk name

```{r ChunkName1} 
```

Writing a R Markdown document

2. Inline code

You can embed one R code command in your text by wrapping it in `r followed by `
- .Rmd file: The square root of 77 is `r sqrt(77)`
- .md file: The square root of 77 is 8.7749644
Steps towards reproducibility
- Use inline commands when reporting results/stats in case your data changes…

Writing a R Markdown document

3. Header

The YAML header is always positioned at the top of a .Rmd, surrounded by ---
Basic elements are title, author and date.

Writing a R Markdown document

3. Header

output tells pandoc what form the final document should take
- html_document (can’t view in GitHub)
- pdf_document (need MikTex or MacTex installed)
- github_document (creates a .md Markdown doc)
- ioslides_presentation, beamer_presentation
each output type can be further refined with formatting options
- e.g. for these slides I specify widescreen=true, incremental=true, and keep_md=false
- for documents, you might use fig.caption=true
for help: in Rstudio select Help > Cheatsheets > R Markdown Reference Guide

Final comments

Start working towards writing reproducible analysis
- doesn’t happen overnight
- using R Markdown is a good start!
- I don’t expect you to use inline code for all your assignments