Day 1

Course overview

  • See the Syllabus for complete breakdown

  • Typical class:
    • Short lecture on the concept we will focus on
    • In class exercise to apply the concepts we just covered
  • Evaluation:
    • Class attendance/Participation [5%]
    • Problem sets (0/1/2 scale) [15%]
    • Team projects [15%]
    • 2 midterm exams [40%]
    • Final project [25%]
  • Note that 40% of your grade is from exams where you have to answer questions and write code in real time

How to succeed

  • Stay up to date with the course outline (I will update regularly)
  • Come to class prepared and on time
  • Ask questions if you start getting lost
    • In class
    • Tutor OH
    • My OH

What the heck is “Data Science”?

Data Science: Google search and Google Trends.

How popular are “Data Science” jobs?

Data Science via the American Statistical Assocation

  • "While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science:
    • Database Management enables transformation, conglomeration, and organization of data resources;
    • Statistics and Machine Learning convert data into knowledge; and
    • Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis."

Data Science via the American Statistical Assocation

  • “Statistics and machine learning play a central role in data science.”
  • “New problem-solving strategies are needed to develop ‘soup to nuts’ pipelines that start with managing raw data and end with user-friendly efficient implementations of principled statistical methods and the communication of substantive results.”

Data Science in Economics

  • Currently not heavily used but adoption is increasing: Stanford Data Science Initiative in Economics

  • Questions in economics typically focus on answering causal questions
    • What is the impact of education on labor outcomes?
    • What is the impact of fiscal policy on GDP?
  • When can we use Data Science tools to help inform economics?
    • When working with big data
      • Ex: Using text analysis algorithms to better understand Yelp or Twitter data
    • When working with predictive problems
      • Ex: Using a random forest model to predict outcomes
    • When helping us acquire data to formulate our question
      • Ex: Using scraping and wrangling tools to acquire data on the web

Data Science in Economics

  • For those interested in academia:
    • Data science tools help us create datasets and analyze them in order to conduct economic research
  • For those interested in private sector jobs
    • New graduates are expected in most companies to have some ability to extract, transport, load, clean, analyze, model, and ‘tell the story’ of their results

Data Science in Economics

Some warnings to heed

  • These days it’s almost TOO EASY to implement complicated statistical algorithms with one line of code.
    • This makes it easy to create fancy graphs and results without really understanding what you are doing
  • One advantage of learning data science in an educational institituion is so you can better understand what is going on under the hood
    • Once we get to the statistical analysis portion, try to make sure you also understand the statistics well

This class

Focus on the “soup to nuts” approach to problem solving

  • Data wrangling
    • Reshaping, cleaning, gathering
  • Learning from data
    • Data visualization tools
    • Statistical learning methods
    • Network data, spatial data
  • Communication
    • Reproducibility
    • Effective visualization
  • Examples

Using R for data science

  • Many data science teams use multiple languages, R and Python being common
  • We will use R for this course
  • Hadley Wickham: best to master one tool at a time.

Using R Markdown for data science

  • You will use R Markdown for all work in this class
  • A Markdown (.Rmd) file contains
    • R code
    • written answers, description of results, report, etc.
  • The Markdown file is knit to generate an output document
    • pdf, html, word
    • presentations (html, beamer pdf)
    • dashboards, interactive graphics (html)
  • Markdown is designed for reproducibility!
  • The slides I produce for this class are R Markdown’s ioslides

Using GitHub and Rstudio for data science

  • Class assignments and projects will be submitted using GitHub
  • Git is version control software that allows for easy collaboration on projects
  • GitHub is an online “hub” where git controlled files are stored in repositories and (possibly) shared with others

Using GitHub and Rstudio for data science

  • Rstudio lets you create git controlled projects
    • create a GitHub repo
    • make a Rstudio project using your cloned repo
    • edit/create files (.rmd, .r, .csv, …)
    • commit changes to your local computer using git
    • push changes to the GitHub repo (online)
    • pull changes made by others to your computer
  • What you need to do

For rest of class

  • Set up R, Rstudio

  • Set up Git, GitHub

  • Work on the test-activity.Rmd file in the test-assignment repo
    • Ask me questions
    • By class time (2:45) Thursday, push your completed test-activity.Rmd and test-activity.md files to GitHub
    • This assignment will act as extra credit towards your problem set score