R Studio Conference 2019

This is a notebook summarizing what I learned at the R Studio Conference 2019 Links to all the talks and slide decks,including talks on how to use R in production at scale are available at https://github.com/kbroman/RStudioConf2019Slides

Make sure the libraries being used in each section are installed.

Reproducible Examples with reprex

The reprex package allows you to create a minimal reproducible example that you can share if you are reporting an issue on Github or asking a question on stack overflow. Running the code chunk below after uncommenting will create a new web page with the code and the results that can be shared with others.

# reprex({
#   x <- 1:4
#   y <- 2:5
#   x + y
# })

Categorical data in R

Factor variables in R can be idiosyncratic

x <- c(20,20,10,40,20)
cat(x)
## 20 20 10 40 20

Converting this to a factor variable produces the following

xf <- factor(x)
xf
## [1] 20 20 10 40 20
## Levels: 10 20 40

Converting this factor back to numeric produces the following odd result

x <- as.numeric(xf)
x
## [1] 2 2 1 3 2

The right way to carry out this conversion is as follows

x <- as.numeric(as.character(xf))
x
## [1] 20 20 10 40 20

One also needs to be careful while re-ordering the levels of a factor

ratings1<-  as.factor(c(rep('High',30),rep('Low',10),rep('Medium',20)))
ratings2 <- ratings1
ratings3 <- ratings1
table(ratings1)
## ratings1
##   High    Low Medium 
##     30     10     20
cat('Levels: \n')
## Levels:
cat(levels(ratings1))
## High Low Medium

We might want to change the levels of this variable to be Low,Medium,High

The next couple of approaches do not yield expected results.

levels(ratings1) <- levels(ratings1)[c(2,1,3)]
table(ratings1)
## ratings1
##    Low   High Medium 
##     30     10     20
levels(ratings2) <- c('Low','Medium','High')
table(ratings2)
## ratings2
##    Low Medium   High 
##     30     10     20

The correct approach here would be as follows

ratings3 <- factor(ratings3,levels=c('Low','Medium','High'))
table(ratings3)
## ratings3
##    Low Medium   High 
##     10     20     30

The forcats package by Hadley Wickham makes working with factors much more straightforward.

library(forcats)
ratings1<-  as.factor(c(rep('High',30),rep('Low',10),rep('Medium',20)))
ratings1 <- fct_relevel(ratings1,c('Low','Medium','High'))
table(ratings1)
## ratings1
##    Low Medium   High 
##     10     20     30

Recoding the levels of the factors are also straightforward.

ratings1 <- fct_recode(ratings1,Poor='Low',Fair ='Medium',Good = 'High')
table(ratings1)
## ratings1
## Poor Fair Good 
##   10   20   30

Plenty of other useful functions are available in the forcats package.

Defensive Coding

The testthat and assertthat packages allows us to code defensively. This is typically best practice in software engineering and makes debugging much easier.

For instance if a function is designed to accept a scalar (i.e. a vector of length 1), you may want the function to throw an error if a vector is passed into the function, so that you can adjust the function appropriately.

library(assertthat)

is_odd <- function(x) {
  assert_that(is.numeric(x), length(x) == 1)
  x %% 2 == 1
}


print(is_odd(3))
## [1] TRUE
print(is_odd(2))
## [1] FALSE
#is_odd(c(1,4)) ##Throws an error##

The see_if function returns a logical value and an error message as an attribute that allows execution to continue.

x <- c(1,2)
y <- 'a'
see_if(is.numeric(x), length(x) == 1)
## [1] FALSE
## attr(,"msg")
## [1] "length(x) not equal to 1"
see_if(is.numeric(y), length(y) == 1)
## [1] FALSE
## attr(,"msg")
## [1] "y is not a numeric or integer vector"

The testthat package provides functionality that is essential to software testing. This is useful if you are building packages or writing code that is meant to be used in production

library(testthat)

#Check for equality within numerical tolerance
expect_that(10,equals(10+1e-7))
###Check for exact equality - Throws an error
#expect_that(2*5, is_identical_to(10 + 1e-7))
model <- lm(mpg~wt,data=mtcars)
expect_that(model,is_a("lm"))
###Throws an error
#expect_that(model,is_a("glm"))

If you are making a function you can use this to ensure that it produces warnings when expected.

#  Two functions below pass
expect_that(log(-1), gives_warning())
expect_that(log(-1),
gives_warning("NaNs produced"))

# This one fails if run
#expect_that(log(0), gives_warning())

New Features in R Studio 1.2

R Studio 1.2 allows incorporation of SQL, Python , RCPP, Stan etc seamlessly into your workflow. The reticulate package allows you to use Python within R Studio.

Python

Note the bit below requires R Studio 1.2 and python’s Anaconda installation. Also you might encounter the issue described here and will have to fix by adding the appropriate path to the environment variable

library(reticulate)
use_condaenv("r-reticulate")
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/learningmachine/AppData/Local/Continuum/anaconda3/Library/plugins/platforms'
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.arange(0.0,2,0.01)
y = np.sin(2* np.pi*x)
plt.plot(x,y)
plt.grid(True)
plt.show()

data = pd.DataFrame({'x':x,'y':y})

A pandas dataframe can also be used within R.

library(ggplot2)
ggplot(py$data,aes(x,y))+ geom_line(col='blue')

SImilarly you can embed SQL or Stan code chunks in your markdown document.

Parsnip - A tidy model interface

library(parsnip)
library(tidymodels)
library(glmnet)
library(randomForest)

Parsnip is tidy version of the popular caret package being developed by Max Kuhn (who also developed caret).This package creates a unified interface to models, organizes them by model type (e.g. Logistic regression, splines etc)and generalizes how to fit them.

First create a model specification

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.01

The computation engine can be one of lm, glmnet,stan,spark or keras. Parsnip can translate this general syntax to the model’s argument.

reg_model %>%
  set_engine("glmnet")%>%
    translate()
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.01
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     lambda = 0.01, family = "gaussian")

Note that data is not required for the model specification. Further although glmnet requires the predictors and target to be provided as X,y matrices , parsnip allows you to use a formula.

reg_model %>%
  set_engine("glmnet")%>%
    fit(mpg ~ .,data = mtcars)
## parsnip model object
## 
## 
## Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      lambda = ~0.01) 
## 
##      Df   %Dev Lambda
## [1,] 10 0.8687   0.01

The whole sequence of steps from specification to prediction can be done as follows

linear_reg(penalty =0.01) %>%
  set_engine("glmnet") %>%
    fit(mpg~.,dat =mtcars %>% slice(1:29)) %>% # train set
      predict(new_data = mtcars %>% slice(30:32)) #test set
## # A tibble: 3 x 1
##   .pred
##   <dbl>
## 1  17.9
## 2  10.9
## 3  25.6

If we want to fit the model using multiple penalty values:

preds <- 
  linear_reg() %>%
    set_engine("glmnet")%>%
      fit(mpg ~. , data = mtcars %>% slice(1:29))%>%
        multi_predict(new_data = mtcars %>% slice(30:32))


print(preds)
## # A tibble: 3 x 1
##   .pred            
##   <list>           
## 1 <tibble [80 x 2]>
## 2 <tibble [80 x 2]>
## 3 <tibble [80 x 2]>

This produces a list of dataframes where each list contains predictions for all penalty values considered. To obtain the first 5 predictions for the second data point:

preds %>% pull(.pred) %>% pluck(2) %>% slice(1:5)
## # A tibble: 5 x 2
##   penalty .pred
##     <dbl> <dbl>
## 1 0.00346  10.9
## 2 0.00380  10.9
## 3 0.00417  10.9
## 4 0.00457  10.9
## 5 0.00502  10.9

In the cases of models like random forest where certain data dependent parameters like mtry need to be specified,the model can be specified as follows:

mod <- rand_forest(trees= 1000, mtry =floor(.preds() * 0.75)) %>%
        set_engine("randomForest")

mod %>% translate()
## Random Forest Model Specification (unknown)
## 
## Main Arguments:
##   mtry = floor(.preds() * 0.75)
##   trees = 1000
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg(), 
##     mtry = floor(.preds() * 0.75), ntree = 1000)

Fit the model.

mod %>% fit(mpg ~ ., data = mtcars)
## parsnip model object
## 
## 
## Call:
##  randomForest(x = as.data.frame(x), y = y, ntree = ~1000, mtry = ~floor(.preds() *      0.75)) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 5.352952
##                     % Var explained: 84.79

As you can see, the number of predictors used is 75 % of the number of predictors in the data.

floor((ncol(mtcars) - 1) *0.75)
## [1] 7

Time Series in Tidyverse

Tidy and Transform the data using the tsibble package and perform modelling/forecasting using fable. tsibble is a time series version of a tibble/dataframe.

library(tidyr)
library(tsibble)
library(lubridate)

We can use the EuStockMarkets data available in base R which is a ts object.

data("EuStockMarkets")

as_tibble can easily convert this into a tsibble in a long format. Here the index represents time and key identifies the variable that defines the series. Here each stock index ha a unique time stamp.

eu_ts <- as_tsibble(EuStockMarkets)
print(eu_ts)
## # A tsibble: 7,440 x 3 [1s] <UTC>
## # Key:       key [4]
##    index               key   value
##    <dttm>              <chr> <dbl>
##  1 1991-07-01 02:18:28 DAX   1629.
##  2 1991-07-02 12:00:00 DAX   1614.
##  3 1991-07-03 21:41:32 DAX   1607.
##  4 1991-07-05 07:23:05 DAX   1621.
##  5 1991-07-06 17:04:37 DAX   1618.
##  6 1991-07-08 02:46:09 DAX   1611.
##  7 1991-07-09 12:27:42 DAX   1631.
##  8 1991-07-10 22:09:14 DAX   1640.
##  9 1991-07-12 07:50:46 DAX   1635.
## 10 1991-07-13 17:32:18 DAX   1646.
## # ... with 7,430 more rows

Given the index in seconds, the tsibble expects an entry for each second, so the index should be converted to day.

eu_ts2 <-  eu_ts %>%
            mutate(index = as.Date(index))
print(eu_ts2)
## # A tsibble: 7,440 x 3 [1s]
## # Key:       key [4]
##    index      key   value
##    <date>     <chr> <dbl>
##  1 1991-07-01 DAX   1629.
##  2 1991-07-02 DAX   1614.
##  3 1991-07-03 DAX   1607.
##  4 1991-07-05 DAX   1621.
##  5 1991-07-06 DAX   1618.
##  6 1991-07-08 DAX   1611.
##  7 1991-07-09 DAX   1631.
##  8 1991-07-10 DAX   1640.
##  9 1991-07-12 DAX   1635.
## 10 1991-07-13 DAX   1646.
## # ... with 7,430 more rows

One can also calculate the average price of a week or month as follows.

eu_ts %>%
    index_by(Month = floor_date(index,'month'))%>%
        group_by(key)%>%
            summarize(Price =  mean(value))%>% head()
## # A tsibble: 6 x 3 [?] <UTC>
## # Key:       key [1]
##   key   Month               Price
##   <chr> <dttm>              <dbl>
## 1 CAC   1991-07-01 00:00:00 1753.
## 2 CAC   1991-08-01 00:00:00 1800.
## 3 CAC   1991-09-01 00:00:00 1871.
## 4 CAC   1991-10-01 00:00:00 1851.
## 5 CAC   1991-11-01 00:00:00 1814.
## 6 CAC   1991-12-01 00:00:00 1693.

We can also check if there are gaps in the time series as follows.

has_gaps(eu_ts2)
## # A tibble: 4 x 2
##   key   .gaps
##   <chr> <lgl>
## 1 CAC   TRUE 
## 2 DAX   TRUE 
## 3 FTSE  TRUE 
## 4 SMI   TRUE

You can get a 30 day rolling average of the process as follows. Note that the first 29 observations of the rolling mean will be NA.

eu_ts3 <-  eu_ts2 %>%
            group_by(key) %>%
              mutate(avg_30day = slide_dbl(value,~mean(.,na.rm = TRUE),.size=30))

print(eu_ts3 %>% filter_index("1991-08-01"~"1991-09-01") )
## # A tsibble: 88 x 4 [1D]
## # Key:       key [4]
## # Groups:    key [4]
##    index      key   value avg_30day
##    <date>     <chr> <dbl>     <dbl>
##  1 1991-08-02 DAX   1620.       NA 
##  2 1991-08-03 DAX   1620.       NA 
##  3 1991-08-05 DAX   1623.       NA 
##  4 1991-08-06 DAX   1614.       NA 
##  5 1991-08-08 DAX   1632.       NA 
##  6 1991-08-09 DAX   1630.       NA 
##  7 1991-08-10 DAX   1633.     1624.
##  8 1991-08-12 DAX   1627.     1624.
##  9 1991-08-13 DAX   1650.     1625.
## 10 1991-08-15 DAX   1650.     1627.
## # ... with 78 more rows

Tsibble works well with ggplot2.

library(ggplot2)
ggplot(eu_ts2,aes(x=index,y=value)) + geom_line()+
  facet_wrap(.~key,ncol=1)

Forecasting can be done using the fable package which is a tidy implementation of the classic forecast package by Rob Hyndman. This package is being developed by Hyndman’s student Earo Wang and is currently not available on CRAN

Model representation using Broom

Use tidy() to summarize information about models Use glance() to report goodness of fit metrics Use augment() to add information about observations

Produces outputs in a simple tibble

The code below shows what we typically get when building a model

attach(mtcars)
library(broom)
ols_fit <- lm(hp~ mpg + cyl, mtcars)
summary(ols_fit)
## 
## Call:
## lm(formula = hp ~ mpg + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.72 -22.18 -10.13  14.47 130.73 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   54.067     86.093   0.628  0.53492   
## mpg           -2.775      2.177  -1.275  0.21253   
## cyl           23.979      7.346   3.264  0.00281 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.22 on 29 degrees of freedom
## Multiple R-squared:  0.7093,	Adjusted R-squared:  0.6892 
## F-statistic: 35.37 on 2 and 29 DF,  p-value: 1.663e-08

Using tidy from the broom package we get:

tidy(ols_fit)
## # A tibble: 3 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    54.1      86.1      0.628 0.535  
## 2 mpg            -2.77      2.18    -1.27  0.213  
## 3 cyl            24.0       7.35     3.26  0.00281

Using glance

glance(ols_fit)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.709         0.689  38.2      35.4 1.66e-8     3  -160.  329.  335.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Using augment gives information about each observation in the training data

augment(ols_fit)
## # A tibble: 32 x 11
##    .rownames    hp   mpg   cyl .fitted .se.fit .resid   .hat .sigma .cooksd
##    <chr>     <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 Mazda RX4   110  21       6   140.     6.84 -29.7  0.0320   38.5 0.00687
##  2 Mazda RX~   110  21       6   140.     6.84 -29.7  0.0320   38.5 0.00687
##  3 Datsun 7~    93  22.8     4    86.7   13.3    6.28 0.121    38.9 0.00141
##  4 Hornet 4~   110  21.4     6   139.     7.00 -28.6  0.0335   38.5 0.00668
##  5 Hornet S~   175  18.7     8   194.    12.8  -19.0  0.112    38.7 0.0117 
##  6 Valiant     105  18.1     6   148.     8.75 -42.7  0.0524   38.0 0.0243 
##  7 Duster 3~   245  14.3     8   206.     9.79  38.8  0.0656   38.2 0.0258 
##  8 Merc 240D    62  24.4     4    82.3   11.6  -20.3  0.0924   38.7 0.0105 
##  9 Merc 230     95  22.8     4    86.7   13.3    8.28 0.121    38.9 0.00246
## 10 Merc 280    123  19.2     6   145.     7.47 -21.7  0.0382   38.7 0.00443
## # ... with 22 more rows, and 1 more variable: .std.resid <dbl>

To compare multiple models on goodness of fit:

library(purrr)

fits <- list( fit1 = lm(hp~cyl,mtcars),
              fit2 = lm(hp ~ cyl +mpg, mtcars),
              fit3 = lm(hp~. , mtcars))

map_df(fits,glance,.id = "model") %>% arrange(AIC)
## # A tibble: 3 x 12
##   model r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1 fit3      0.903         0.857  26.0      19.5 1.90e-8    11  -143.  310.  327.
## 2 fit1      0.693         0.683  38.6      67.7 3.48e-9     2  -161.  329.  333.
## 3 fit2      0.709         0.689  38.2      35.4 1.66e-8     3  -160.  329.  335.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

A more realistic application for broom are cases where you want to bootstrap the parameter estimates of a model

Consider the model $$ mpg = \frac{k}{wt} + b + \epsilon, \sim Normal(0,\sigma^2) $$

To estimate the bootstrapped estimates, you can do the following

library(rsample)
boots <- bootstraps(mtcars,times=100)

Function to fit non linear least squares

fit_nls_on_bootstrap <- function(split) {
  nls(
    mpg ~ k/wt + b,
    analysis(split),
    start = list(k=1,b=0)
  )
}

Bootstrap as follows:

boot_fits <- boots %>%
              mutate(fit = map(splits,fit_nls_on_bootstrap),
                     coef_info = map(fit,tidy))

head(boot_fits)
## # A tibble: 6 x 4
##   splits          id           fit       coef_info       
## * <list>          <chr>        <list>    <list>          
## 1 <split [32/13]> Bootstrap001 <S3: nls> <tibble [2 x 5]>
## 2 <split [32/12]> Bootstrap002 <S3: nls> <tibble [2 x 5]>
## 3 <split [32/11]> Bootstrap003 <S3: nls> <tibble [2 x 5]>
## 4 <split [32/13]> Bootstrap004 <S3: nls> <tibble [2 x 5]>
## 5 <split [32/11]> Bootstrap005 <S3: nls> <tibble [2 x 5]>
## 6 <split [32/11]> Bootstrap006 <S3: nls> <tibble [2 x 5]>

Get bootstrapped coefficients

library(resample)
## 
## Attaching package: 'resample'
## The following object is masked from 'package:broom':
## 
##     bootstrap
library(tidyr)

data("mtcars")
mtcars_bs <- bootstraps(mtcars,times=20)
boot_coefs <- boot_fits %>%
                unnest(coef_info)

head(boot_coefs)
## # A tibble: 6 x 6
##   id           term  estimate std.error statistic  p.value
##   <chr>        <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 Bootstrap001 k        41.2       3.38     12.2  3.86e-13
## 2 Bootstrap001 b         5.56      1.23      4.53 8.86e- 5
## 3 Bootstrap002 k        38.8       2.82     13.8  1.66e-14
## 4 Bootstrap002 b         6.52      1.03      6.36 5.15e- 7
## 5 Bootstrap003 k        51.7       4.80     10.8  8.12e-12
## 6 Bootstrap003 b         2.61      1.70      1.53 1.36e- 1

Plot bootstrapped estimates as follows

p <- ggplot(boot_coefs,aes(estimate))+
      geom_histogram(binwidth = 2) + facet_wrap(~ term,scales='free')+
        labs(title = "Sampling distribution of k and b",
             y = "Count",
             x = "Value" )



p

We can also plot the bootstrapped predictions as follows

boot_aug <- boot_fits %>%
              mutate(augmented = map(fit,augment))%>%
                unnest(augmented)


ggplot(boot_aug,aes(wt,mpg))+
  geom_point()+
    geom_line(aes(y = .fitted,group =id),alpha =0.2)

gganimate

Package for animated graphics which extends ggplot2. Additional dependencies will have to be installed.

library(gapminder)
library(gganimate)


ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')

Hypothetical Outcome Plots

Hypothetical outcome plots provide an alternative way to visualize uncertainty by animating a finite set of individual draws rather than producing a static representation of distributions.

Load required packages.

library(rsample)
library(purrr)
library(ggplot2)
library(gganimate)
library(transformr)

The plot below uses bootstrapped samples to visualize uncertainty in the trend line.

attach(mtcars)


#Draw bootstrap samples

mtcars_bs <- bootstraps(mtcars,times=20)

mtcars_bs_df <- map_df(mtcars_bs$splits, function(x){
                        return(as.data.frame(x))
                        })

mtcars_bs_df$id <- rep(c(1:20),each=nrow(mtcars))


mtcars %>% 
  ggplot(aes(disp,mpg))+
    geom_point()+ 
      geom_smooth(
        data = mtcars_bs_df,
        aes(group =id),
        se = FALSE,
        alpha = 0.6
      )

The same can be represented using a hypothetical outcome plot as follows

mtcars_bs_df %>%
  ggplot(aes(disp,mpg))+
    geom_smooth(aes(group =id),se=FALSE,alpha=0.6)+
      geom_point(data=mtcars)+
        transition_states(id)

Datapasta

If you want to quickly copy some sample data from excel or SQL into R, you often have to convert it into csv or excel file and then import it or type it manually. Datapasta is an R Studio add in that provides a much easier alternative.

After installing the package ‘datapasta’ , simply copy the table from an excel file, go to add ins and select the appropriate option.

Here, I simply copied a table from excel using ‘Ctrl+C’ and then selected ‘Paste as data.frame’ from the add ins dropdown.

data.frame(stringsAsFactors=FALSE,
        Name = c("A", "B", "C", "D", "E"),
       Marks = c(87L, 35L, 76L, 34L, 86L)
)
##   Name Marks
## 1    A    87
## 2    B    35
## 3    C    76
## 4    D    34
## 5    E    86

vctrs

This is another package from Hadley Wickham designed to make sure that outputs from functions are predictable.

Sometimes, R does not behave predictable. For instance when combining two factors..

c(factor("a"),factor("b"))
## [1] 1 1

Also..

today <- as.Date('2019-02-23')
tomorrow <- as.POSIXct("2019-02-23 12:00",tz = "America/Chicago")

c(today,tomorrow)
## [1] "2019-02-23"    "4248312-08-21"

Also..

c(NULL,tomorrow)
## [1] 1550944800

The vctrs package formalizes the relationships between various datatypes and requires the user to be more explicit when combining different datatypes.

To summarize: Logical –> Integer –> Double Date –> Date-time Factor –>Character Ordered –> Character

library(vctrs)
## 
## Attaching package: 'vctrs'
## The following object is masked from 'package:lubridate':
## 
##     new_duration

When coercing to character , it is stricter than base R. vec_c below would throw an error

c(1.5,"a")
## [1] "1.5" "a"
#vec_c(1.5,"a")

Instead you have to be explicit..

vec_c(1.5,"x",.ptype = character())
## [1] "1.5" "x"

Unlike in base R, the following works..

vec_c(factor("a"),factor("b"))
## [1] a b
## Levels: a b

Note the order affects the levels..

vec_c(factor("b"),factor("a"))
## [1] b a
## Levels: b a

Similarly when concatenating a factor and an ordered factor…

c(factor("a"),ordered("b"))
## [1] 1 1

The first call below would throw an error..

#vec_c(factor("a"),ordered("b"))
vec_c(factor("a"),ordered("b"),.ptype = character())
## [1] "a" "b"

The date problem does not occur here..

vec_c(today,tomorrow)
## [1] "2019-02-23 00:00:00 CST" "2019-02-23 12:00:00 CST"

Disparate datatypes can be combined into a list

vec_c(today,factor("a"),.ptype=list())
## [[1]]
## [1] "2019-02-23"
## 
## [[2]]
## [1] a
## Levels: a

Tidy Evaluation

A tool for metaprogramming. This becomes important when you do not know the names of an object in advance and you want to build a tidyverse pipeline which operates on such objects. This is often the case when you have to operate on user inputs, say in a shiny application where you have to make indirect references with column names stored in variables or passed as function arguments.

It is tidy evaluation that really allows you to use column names below without using quotes.

attach(starwars)

starwars %>%
  filter(homeworld == "Tatooine")%>%
    arrange(height)%>%
      select(name,ends_with("color"))
## # A tibble: 10 x 4
##    name               hair_color  skin_color eye_color
##    <chr>              <chr>       <chr>      <chr>    
##  1 R5-D4              <NA>        white, red red      
##  2 Shmi Skywalker     black       fair       brown    
##  3 Beru Whitesun lars brown       light      blue     
##  4 C-3PO              <NA>        gold       yellow   
##  5 Luke Skywalker     blond       fair       blue     
##  6 Owen Lars          brown, grey light      blue     
##  7 Biggs Darklighter  black       light      brown    
##  8 Cliegg Lars        brown       fair       blue     
##  9 Anakin Skywalker   blond       fair       blue     
## 10 Darth Vader        none        white      yellow

To evaluate an expression, the interpreter searches the environment for name-value bindings. In tidy eval, the expression is modified or the chain of searched environments is modified. This is what allows you to use unquoted variable names and work inside a dataframe. This is also referred to as non-standard evaluation.

Tidy eval can be implemented in the following ways..

  1. Using ‘…’
attach(mtcars)

measure_hp <- function(df,...){
  df %>%
    group_by(...)%>%
      summarize(avg_hp = mean(hp,na.rm=TRUE))}

measure_hp(mtcars,gear)
## # A tibble: 3 x 2
##    gear avg_hp
##   <dbl>  <dbl>
## 1     3  176. 
## 2     4   89.5
## 3     5  196.
  1. Using enquo and !!(called ‘bang!bang!’ operator)
measure<- function(df,group_var,summary_var){
  group_var <- enquo(group_var) #Effectively quotes the variable
  summary_var <- enquo(summary_var)
  
  df %>%
    group_by(!!group_var)%>%
      summarize(avg_hp = mean(!!summary_var,na.rm=TRUE))}

measure(mtcars,gear,hp)
## # A tibble: 3 x 2
##    gear avg_hp
##   <dbl>  <dbl>
## 1     3  176. 
## 2     4   89.5
## 3     5  196.

If you need to manipulate user input that is passed as character strings,you can do the following..

library(rlang)
group_var <- 'gear'
summary_var <- 'hp'

measure(mtcars,!!rlang::sym(group_var),!!rlang::sym(summary_var))
## # A tibble: 3 x 2
##    gear avg_hp
##   <dbl>  <dbl>
## 1     3  176. 
## 2     4   89.5
## 3     5  196.

Action and Selection Verbs

dplyr verbs comes in two flavors - action and selection. select is a selection verb while verbs like mutate,group_by and transmute are action verbs.

Action verbs created new vectors and modify the dataframe.Selection verbs look up the position of the columns in the dataframe and re-organizes the dataframe.

Selection verbs are context aware and know about current variables.

starwars %>% select(c(1:height))
## # A tibble: 87 x 2
##    name               height
##    <chr>               <int>
##  1 Luke Skywalker        172
##  2 C-3PO                 167
##  3 R2-D2                  96
##  4 Darth Vader           202
##  5 Leia Organa           150
##  6 Owen Lars             178
##  7 Beru Whitesun lars    165
##  8 R5-D4                  97
##  9 Biggs Darklighter     183
## 10 Obi-Wan Kenobi        182
## # ... with 77 more rows
starwars %>% select(1:height)
## # A tibble: 87 x 2
##    name               height
##    <chr>               <int>
##  1 Luke Skywalker        172
##  2 C-3PO                 167
##  3 R2-D2                  96
##  4 Darth Vader           202
##  5 Leia Organa           150
##  6 Owen Lars             178
##  7 Beru Whitesun lars    165
##  8 R5-D4                  97
##  9 Biggs Darklighter     183
## 10 Obi-Wan Kenobi        182
## # ... with 77 more rows
starwars %>% select(-1,-height)
## # A tibble: 87 x 11
##     mass hair_color skin_color eye_color birth_year gender homeworld species
##    <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>  
##  1    77 blond      fair       blue            19   male   Tatooine  Human  
##  2    75 <NA>       gold       yellow         112   <NA>   Tatooine  Droid  
##  3    32 <NA>       white, bl~ red             33   <NA>   Naboo     Droid  
##  4   136 none       white      yellow          41.9 male   Tatooine  Human  
##  5    49 brown      light      brown           19   female Alderaan  Human  
##  6   120 brown, gr~ light      blue            52   male   Tatooine  Human  
##  7    75 brown      light      blue            47   female Tatooine  Human  
##  8    32 <NA>       white, red red             NA   <NA>   Tatooine  Droid  
##  9    84 black      light      brown           24   male   Tatooine  Human  
## 10    77 auburn, w~ fair       blue-gray       57   male   Stewjon   Human  
## # ... with 77 more rows, and 3 more variables: films <list>, vehicles <list>,
## #   starships <list>

You can also use selection helpers as follows..

starwars %>% select(ends_with("color")) %>% head()
## # A tibble: 6 x 3
##   hair_color  skin_color  eye_color
##   <chr>       <chr>       <chr>    
## 1 blond       fair        blue     
## 2 <NA>        gold        yellow   
## 3 <NA>        white, blue red      
## 4 none        white       yellow   
## 5 brown       light       brown    
## 6 brown, grey light       blue
starwars %>% select(dplyr::matches("^[nm]a")) %>% head()
## # A tibble: 6 x 2
##   name            mass
##   <chr>          <dbl>
## 1 Luke Skywalker    77
## 2 C-3PO             75
## 3 R2-D2             32
## 4 Darth Vader      136
## 5 Leia Organa       49
## 6 Owen Lars        120
starwars %>% select(10,everything()) %>% head()
## # A tibble: 6 x 13
##   species name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Human   Luke~    172    77 blond      fair       blue            19   male  
## 2 Droid   C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
## 3 Droid   R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
## 4 Human   Dart~    202   136 none       white      yellow          41.9 male  
## 5 Human   Leia~    150    49 brown      light      brown           19   female
## 6 Human   Owen~    178   120 brown, gr~ light      blue            52   male  
## # ... with 4 more variables: homeworld <chr>, films <list>, vehicles <list>,
## #   starships <list>

group_by is an action verb, it does not work with selection helpers. The code below does not work.

#starwars %>% group_by(ends_with"color")

Instead you can use..

starwars %>% group_by_at(vars(ends_with("color"))) %>% summarise(mean_height=mean(height))
## # A tibble: 67 x 4
## # Groups:   hair_color, skin_color [50]
##    hair_color    skin_color       eye_color mean_height
##    <chr>         <chr>            <chr>           <dbl>
##  1 <NA>          gold             yellow            167
##  2 <NA>          green            black             173
##  3 <NA>          green-tan, brown orange            175
##  4 <NA>          white, blue      red                96
##  5 <NA>          white, red       red                97
##  6 auburn        fair             blue              150
##  7 auburn, grey  fair             blue              180
##  8 auburn, white fair             blue-gray         182
##  9 black         blue, grey       yellow            137
## 10 black         brown            brown             171
## # ... with 57 more rows

profviz

profviz can be used to profile your code, identify modules that slow down the program and take corrective action.

library(profvis)
#profvis::profvis(mtcars %>% lm(mpg ~ wt +cyl,.))

Using Spark to scale out your jobs

Spark allows you to leverage compute from mutliple machines by distributing computations across multiple machines. The sparklyr package allows you to leverage the power of Spark through R.

Although we connect to spark running on the local machine below, we can easily connect to a cluster provided by a cloud provider. Make sure Java is installed in a directory which does not have spaces or special characters. This link may also be useful

library(DBI)
library(dplyr)
library(sparklyr)                                   #R Interface to Apache Spark
## 
## Attaching package: 'sparklyr'
## The following object is masked from 'package:rlang':
## 
##     invoke
## The following object is masked from 'package:purrr':
## 
##     invoke
#spark_install()                                    #Install Apache Spark

#Add path to java to environment variables
Sys.setenv(JAVA_HOME = 'C:/Users/learningmachine/Java/jre1.8.0_201')


sc <- spark_connect(master = 'local')               #Connect to local cluster
##Make sure you have the mtcars.csv file in your working directory
cars_tbl <- spark_read_csv(sc,"cars","mtcars.csv")     #Read data into spark

summarize(cars_tbl, n =n())                         #Count records with dplyr
## # Source: spark<?> [?? x 1]
##       n
##   <dbl>
## 1    32
dbGetQuery(sc,"SELECT count(*) FROM cars")          # COunt records with DBI
##   count(1)
## 1       32

Perform linear regression

ml_linear_regression(cars_tbl,mpg~wt+cyl)
## Formula: mpg ~ wt + cyl
## 
## Coefficients:
## (Intercept)          wt         cyl 
##   39.686261   -3.190972   -1.507795

You can also define a pipeline and run it on a spark cluster and save it to disk.

pipeline <- ml_pipeline(sc) %>%                #Define Spark pipleine
              ft_r_formula(mpg~wt + cyl)%>%    #Add formula translation
                ml_linear_regression()         #Add model to pipeline

fitted <- ml_fit(pipeline,cars_tbl)
fitted
## PipelineModel (Transformer) with 2 stages
## <pipeline_d2021587785> 
##   Stages 
##   |--1 RFormulaModel (Transformer)
##   |    <r_formula_d201fa95296> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Transformer Info)
##   |      formula:  chr "mpg ~ wt + cyl" 
##   |--2 LinearRegressionModel (Transformer)
##   |    <linear_regression_d201098f3a> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |     (Transformer Info)
##   |      coefficients:  num [1:2] -3.19 -1.51 
##   |      intercept:  num 39.7 
##   |      num_features:  int 2 
##   |      scale:  num 1
#ml_save(fitted,"mtcars_model",overwrite = TRUE)

The sparklyr output of a saved Spark ML Pipeline object is in Scala code, which means that the code can be added to the scheduled Spark ML jobs, and without any dependencies in R.

Shiny

Some resources worth exploring on shiny and included in the link shared at the beginning of the document are

  1. Modules: How to compartmentalize your shiny application into modules for easier maintenance and concurrent development

  2. Reactlog 2.0: Library built in Javascript that allows for easier debugging by visualizing reactive dependencies in the app. Th goto tool for reactivity debugging

Miscellaneous

Sunburst Charts

Pretty cool visualization for representing hierarchical data using concentric circles.

library(sunburstR)
sequences <- read.csv(
  system.file("examples/visit-sequences.csv",package="sunburstR")
  ,header=F
  ,stringsAsFactors = FALSE
)

sunburst(sequences)
Legend

rayshader

This is a really cool package. According to the documentation:

rayshader uses a combination of raytracing, spherical texture mapping, lambertian reflectance, and ambient occlusion to produce hillshades of elevation matrices. Includes water detection and layering functions, programmable color palette generation, several built-in textures, 2D and 3D plotting options, and the ability to export 3D maps to a 3D printable format.

Try running the code below to see some really cool 3d visualizations.

library(rayshader)

montereybay %>%
sphere_shade(texture="imhof2") %>%
plot_3d(montereybay, zscale=50, water = TRUE, watercolor="imhof2",
waterlinecolor="white", waterlinealpha=0.5)

Other useful/interesting resources

  1. Best practices for naming files: link

  2. mlflow for tracking ML experiments,sharing and deploying : link

  3. All the risk calculators at http://riskcalc.org/ were built by Cleveland Clinic using Shiny

Govind G Nair
Govind G Nair
Senior Product Manager

Related