Speed-data-ing with R

Alex Vlasiuk

October 3, 2017

“The greatest value of a picture is when it forces us to notice what we never expected to see.”
— John Tukey

Benefits of plotting things

pretty pictures
reproducible conclusions
compact way of communicating information
informed choices

Miles per gallon data

str(mpg)

## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

colnames(mpg)

##  [1] "manufacturer" "model"        "displ"        "year"        
##  [5] "cyl"          "trans"        "drv"          "cty"         
##  [9] "hwy"          "fl"           "class"

Pairs of variables

Miles per gallon plotted

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +  
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: color for class

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +  
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: color for no. of cylinders

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + geom_point() +
   xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: size for no. of cylinders

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy,  size = cyl), alpha = .4 ) +  
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: Facetting

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~class)+
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: Modelling

ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: Smoothing

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = "loess") +
  xlab("Engine volume") + ylab("Highway miles per gallon")

LOESS

?loess

Description 
Fit a polynomial surface determined by one or more numerical predictors,
using local fitting. 
Usage 
loess(formula, data, weights, subset, na.action, model = FALSE,
      span = 0.75, enp.target, degree = 2,
      parametric = FALSE, drop.square = FALSE, normalize = TRUE,
      family = c("gaussian", "symmetric"),
      method = c("loess", "model.frame"),
      control = loess.control(...), ...)

LOESS vs linear model

LOESS vs linear model: code

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(aes(colour = "loess"), method = "loess", se = FALSE) +
  geom_smooth(aes(colour = "lm"), method = "lm", se = FALSE) +
  labs(colour = "Method") +
  xlab("Engine volume") + ylab("Highway miles per gallon")

MPG: Smoothing & facetting

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "loess") +
  facet_wrap(~year) + xlab("Engine volume") + ylab("Highway miles per gallon")

Summary of R/ggplot2 workflow

Import and clean data
Visualize, check for dependencies
Build models, produce new data
Repeat

ggplot2 syntax

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <VISUAL THEMING FUNCTION>

Politics and data

Using the built-in “economics” dataset, we can visualize the unemployment level:

ggplot(economics, aes(date, unemploy)) + 
  geom_line() + 
  xlab("Timeline") + ylab("Unemployment level, thousands")

Politics and data

Housing and data

Information about the housing market in Texas provided by the TAMU real estate center,
https://recenter.tamu.edu/.
A data frame with 8602 observations and 9 variables:
 city
  Name of MLS area 
year,month,date
  Date 
sales
  Number of sales 
volume
  Total value of sales 
median
  Median sale price 
listings
  Total active listings 
inventory
  "Months inventory": amount of time it would take to sell all current listings at current pace of sales.

Grouped by city

ggplot(txhousing, aes(date, sales)) +
  geom_line(aes(group = city), alpha = 1/2)

Problems: seasonal trend; small vs big cities.

Log-scale

ggplot(txhousing, aes(date, log(sales))) +
  geom_line(aes(group = city), alpha = 1/2)

Remove the trend: smaller city

We are using the categorical prediction to remove the monthly trend.

abilene <- txhousing %>% filter(city == "Abilene")
ggplot(abilene, aes(date, log(sales))) +
  geom_line()

Trend removed: smaller city

mod <- lm(log(sales) ~ factor(month), data = abilene)
abilene$rel_sales <- resid(mod)
ggplot(abilene, aes(date, rel_sales)) +
  geom_line()

Processing

txhousing <- txhousing %>%
  group_by(city) %>%
  mutate(rel_sales = resid(lm(log(sales) ~ factor(month),
  na.action = na.exclude)) 
  )

Trend removed

ggplot(txhousing, aes(date, rel_sales)) +
  geom_line(aes(group = city), alpha = 1/5) +
  geom_line(stat = "summary", fun.y = "mean", colour = "red")

Data import and cleaning: speed dating dataset

sdating <- read.csv("Speed Dating Data.csv")

This dataset was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment. Data was gathered from about 8400 participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute “first date” with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

Age of participants

Age above 30

us<-map_data('state')
ggplot(mymap, aes(longitude,latitude)) +
geom_polygon(data=us, aes(x=long, y=lat, group=group), color='black', fill=NA, alpha=.5)+
geom_point(aes(size = age), alpha=.7, color = 'blue') +
  facet_wrap(~race) +
  coord_quickmap() +
  xlim(-100,-60)+ylim(25,50)

us<-map_data('state')
ggplot(filter(mymap, age> 30),aes(longitude,latitude)) +
geom_polygon(data=us, aes(x=long,y=lat,group=group), color='black', fill=NA, alpha=.5)+
geom_point(aes(size = age), alpha=.7, color = 'blue') +
  facet_wrap(~race) +
  coord_quickmap() +
  xlim(-100,-60)+ylim(25,50)

Age above 30

Importance of religion

Importance of religion >5

Caveat

too little data!
more useless comparisons

Pretty data visualizations

Possible uses of pretty data presentation

just any type of publications: newspapers are using it, so can you!
experimental input: searching for irregularities in research data, bio, nuclear physics, etc
experimental output: scientific authoring, publication-quality diagrams, plots
understanding public/government data
household uses: smart homes/fitbits/dishwashers/fridges/litter robots
server logs
email/document/family budget classification

More tools

D3 is JS-based and very cool https://github.com/d3/d3, examples https://bl.ocks.org/mbostock
Kaggle https://www.kaggle.com/ is where all kinds of datasets and data tools live
A plethora of publicly available data, see e.g. https://catalog.data.gov/dataset
R resources: Hadley Wickham’s publicly available books & code; “tidyverse” ecosystem
RStudio cheatsheets: Visualization

Speed-data-ing with R

Alex Vlasiuk

October 3, 2017

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

Benefits of plotting things

Miles per gallon data

Pairs of variables

Miles per gallon plotted

MPG: color for class

MPG: color for no. of cylinders

MPG: size for no. of cylinders

MPG: Facetting

MPG: Modelling

MPG: Smoothing

LOESS

LOESS vs linear model

LOESS vs linear model: code

MPG: Smoothing & facetting

Summary of R/ggplot2 workflow

ggplot2 syntax

Politics and data

Politics and data

Housing and data

Grouped by city

Log-scale

Remove the trend: smaller city

Trend removed: smaller city

Processing

Trend removed

Data import and cleaning: speed dating dataset

Age of participants

Age above 30

Age above 30

Importance of religion

Importance of religion >5

Caveat

Pretty data visualizations

Possible uses of pretty data presentation

More tools

Thanks!

“The greatest value of a picture is when it forces us to notice what we never expected to see.”
— John Tukey