rstanmulticore: A cross-platform R package to automatically run RStan MCMC chains in parallel

*** This work has been supported by a grant from the Spencer Foundation (#201400002). The views expressed are those of the author and do not necessarily reflect those of the Spencer Foundation. ***

It seems that the heir to WinBUGS is Stan. With Stan, reasonably complex Bayesian models can be expressed in a compact way and easily estimated. It is good software, and it is under active development to further improve it.

I have a small quibble about RStan, the R interface to Stan. RStan would be much improved if its default behavior was to run one MCMC chain per core. For software that prides itself on speed - Stan goes to the trouble translating the Stan modeling specification into a stand-alone C++ program for execution - it seems a little odd that the extra cores present on nearly all modern machines would not be put to use by default.

Currently, running chains in parallel is possible, but only with platform dependent boilerplate code. For example, the RStan Quick Start Guide gives an mclapply example for Mac and Linux users and a parLapply example for Windows users. The boilerplate nature of the code makes it cumbersome to fit models several times, and the platform dependent nature of the examples makes it difficult to share code between platforms.

To address this issue, I have implemented the boilerplate code from the Quick Start Guide in a cross-platform R package: rstanmulticore. The syntax is easy. Simply replace a call to stan

fit.serial   <- stan( model_code = schools_code, data = schools_dat, 
                      iter = 1000, chains = 4)

with a call to pstan

fit.parallel <- pstan(model_code = schools_code, data = schools_dat, 
                      iter = 1000, chains = 4)

The pstan version will compile the model, distribute the compiled models to separate cores for a parallel run, and then recombine the results as if the code had execute serially. Since I used parLapply to distribute the work to multiple cores, pstan will run on Windows, Linux, and Mac.

At least as far as I have pushed pstan, it works well for me. Your needs may differ. I would appreciate feedback and suggestions on how to improve it. You can access it via GitHub here. Installation instructions and a brief usage example are below.

Installation from GitHub

Step 0.A : If you do not already have rstan installed, install it using the instructions here.

Step 0.B: If you do not already have devtools installed, install it using the instructions here.

Step 1: Install rstanmulticore directly from my GitHub repository using install_github('nathanvan/rstanmulticore').

> library(devtools)
> install_github('nathanvan/rstanmulticore')
Downloading github repo nathanvan/rstanmulticore@master
Installing rstanmulticore
"C:/PROGRA~1/R/R-31~1.3/bin/x64/R" --vanilla CMD INSTALL  \
  "C:/Users/vanhoudnos-nathan/AppData/Local/Temp/RtmpQBcRKa/devtools924351029d0/nathanvan-rstanmulticore-c7f9d4e"  \
  --library="C:/Users/vanhoudnos-nathan/Documents/R/win-library/3.1" --install-tests 

* installing *source* package 'rstanmulticore' ...
** R
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (rstanmulticore)

A usage example

We begin with the default "Eight Schools" example from the Quick Start Guide using the default stan function:

library(rstan) 
## Loading required package: Rcpp
## Loading required package: inline
## 
## Attaching package: 'inline'
## 
## The following object is masked from 'package:Rcpp':
## 
##     registerPlugin
## 
## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)

## The data to analyze (Yes, it is very little!)
schools_dat <- list(
  J = 8, y = c(28,  8, -3,  7, -1,  1, 18, 12),
  sigma = c(15, 10, 16, 11,  9, 11, 10, 18))

## The Stan model for the data, stored as a string
schools_code <- 'data {
  int J; // number of schools 
  real y[J]; // estimated treatment effects
  real sigma[J]; // s.e. of effect estimates 
}
parameters {
  real mu; 
  real tau;
  real eta[J];
}
transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] <- mu + tau * eta[j];
}
model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}'
      
## The data to analyze (Yes, it is very little!)
schools_dat <- list(
  J = 8, y = c(28,  8, -3,  7, -1,  1, 18, 12),
  sigma = c(15, 10, 16, 11,  9, 11, 10, 18))

## The Stan model for the data, stored as a string
schools_code <- 'data {
  int J; // number of schools 
  real y[J]; // estimated treatment effects
  real sigma[J]; // s.e. of effect estimates 
}
parameters {
  real mu; 
  real tau;
  real eta[J];
}
transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] <- mu + tau * eta[j];
}
model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}'

## Estimating the model 
fit.serial   <- stan( model_code = schools_code, data = schools_dat, 
                      iter = 1000, chains = 4, seed = 1)
## 
## TRANSLATING MODEL 'schools_code' FROM Stan CODE TO C++ CODE NOW.
## COMPILING THE C++ CODE FOR MODEL 'schools_code' NOW.
## cygwin warning:
##   MS-DOS style path detected: C:/PROGRA~1/R/R-31~1.3/etc/x64/Makeconf
##   Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-31~1.3/etc/x64/Makeconf
##   CYGWIN environment variable option "nodosfilewarning" turns off this warning.
##   Consult the user's guide for more details about POSIX paths:
##     http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
## 
## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 1).
##
##  ... < snip > ...  
## 
## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 2).
## 
##  ... < snip > ... 
## 
## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 3).
## 
##  ... < snip > ... 
## 
## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 4).
## 
##  ... < snip > ... 

Note that stan is pretty verbose.

I chose to make pstan less verbose. By default, pstan reports sparse progress information to the R console and the more detailed information is redirected to a file, stan-debug-*, that is created in the current working directory. (If you wish to see the detailed info in real time, use tail -f in your shell.)

Usage is as follows:

library(rstanmulticore)
## Loading required package: parallel

fit.parallel <- pstan( model_code = schools_code, data = schools_dat, 
                       iter = 1000, chains = 4, seed = 1)
## *** Parallel Stan run ***
## Working directory:
##  C:/Users/vanhoudnos-nathan/workspace/norc/spencer-5866.01.62/software/tmp
##  + Compiling the Stan model.
##  + Attempting  4  chains on 4 cores.
##    ... Creating the cluster.
##    ... Log file: stan-debug.2015-05-01-12.38.21.txt
##    ... Loading rstan on all workers.
##    ... Exporting the fitted model and data to all workers.
##    ... Running parallel chains.
##    ... Finished!

If, in the unlikely case, you want no console output and no file redirection, you can pass pdebug = FALSE to pstan. See help(pstan) for details.

Note that, as promised, the output -- the actual samples drawn from the posterior -- of pstan is identical to that of stan

all.equal( fit.serial@sim$samples, fit.parallel@sim$samples )
## [1] TRUE

A request for help

As mentioned, rstanmulticore works well for my needs, but it may not work for you. If it does not work for you, please let me know and I'll do my best to accommodate you. Pull requests and additional test cases are most welcome!

19 thoughts on “rstanmulticore: A cross-platform R package to automatically run RStan MCMC chains in parallel

  1. Pingback: Distilled News | Data Analytics & R

  2. Bob Carpenter

    Thanks for sharing --- I like the way the parallel version mirrors the basic stan() function.

    Ben Goodrich has a generic cross-platform parallelism wrapper in the works for RStan itself in the near future.

    Reply
    1. nmv Post author

      Thanks for the comment. More importantly, thanks for all the work you do on Stan!

      I look forward to an implementation in RStan itself. I'd be happy to help if you (or Ben Goodrich) have need of me.

      Reply
  3. Anders Gorm Pedersen

    This is very useful, and just what I needed - thanks for making this available! I like the clean approach where a call to stan is simply replaced by a call to pstan!

    However, in that context I have a few minor problems concerning options to pstan:

    (1) It appears that pstan requires the model to be specified using the "model_code" option, and that it will not accept "file" or "fit"? (at least I could not get this to work). Assuming that I did not do something wrong: It would be awesome to have those options available also I think, so hereby requested!

    (2) For some of the option parameters it appears that they have to be given in the form of actual values in the call to pstan, while they are not accepted if given as variable names. This appears to be the case for "iter", "warmup", and "thin", but not for "data", "pars", and "chains".

    As an example, the following call runs just fine:

    stanfit = pstan(model_code=mymod, data=data1, pars=params, iter=1000)

    While the following does not run:

    nit <- 1000
    stanfit = pstan(model_code=mymod, data=data1, pars=params, iter=nit)

    The latter gives the following error message in the debug:

    rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)
    Error in config_argss(chains = chains, iter = iter, warmup = warmup, thin = thin, :
    object 'nit' not found

    (I can provide full code if needed).

    Reply
    1. nmv Post author

      Thanks for the feedback. I have added the request for 'file' support and the bug fix for passing in variables to the GitHub issue tracker for the project: https://github.com/nathanvan/rstanmulticore/issues

      The fit argument, however, should be currently supported. Continuing the example in the post:

      fit.parallel.2 < - pstan( fit = fit.serial, data = schools_dat, 
                             iter = 1000, chains = 4, seed = 1)
      ## *** Parallel Stan run ***
      ## Working directory:
      ##  C:/Users/vanhoudnos-nathan/workspace/norc/spencer-5866.01.62/software/tmp
      ##  + Compiled Stan model supplied.
      ##  + Attempting  4  chains on 4 cores.
      ##    ... Creating the cluster.
      ##    ... Log file: stan-debug.2015-05-13-07.49.55.txt
      ##    ... Loading rstan on all workers.
      ##    ... Exporting the fitted model and data to all workers.
      ##    ... Running parallel chains.
      ##    ... Finished!
      

      Could you let me know what was not working with supplying the fit?

      Reply
      1. Anders Gorm Pedersen

        Excellent - thanks! Everything works for me now. Really a very nice addition to the workflow!

        The "fit" (compiled model) option also works for me now, so I suspect my previous problem must have been caused by something else...

        Reply
  4. Harry Hiemstra

    Thanks for making this program available.

    I have troubles in getting rstanmulticore working when providing different sets of initial values for the chains. I get the following error message:

    Error in combining the results of parallel workers:
    The following elements of 'sflist' do not contain samples: 1, 2, 3.

    Could you provide an example

    Reply
    1. nmv Post author

      I fixed it in version 0.2.0. You can update with devtools::install_github('nathanvan/rstanmulticore').

      As requested, I added a usage example to the (new!) vignette. You can access it with vignette('rstanmulticore-examples'). I have also reproduced it below:

      ## Set initial values
      chain1 < - list(mu = -100, tau = 1, eta = rep(100, 8))
      chain2 <- list(mu = 100, tau = 1, eta = rep(200, 8))
      chain3 <- list(mu = 1000, tau = 100, eta = rep(300.5, 8))
      chain4 <- list(mu = -1000, tau = 100, eta = rep(400, 8))
      
      ## Check that it works with stan
      fit.serial.init   <-  stan( fit = fit.serial, data = schools_dat,
                                  init = list( chain1, chain2, chain3, chain4),
                                  iter = 10000, chains = 4, seed = 1)
      
      ## Check that it works with pstan
      fit.parallel.init <- pstan( fit = fit.serial, data = schools_dat,
                                  init = list( chain1, chain2, chain3, chain4),
                                  iter = 10000, chains = 4, seed = 1)
      
      ## Check that the simulations are the same
      all.equal( fit.serial.init@sim$samples, fit.parallel.init@sim$samples )
      
      Reply
  5. Harry Hiemstra

    Yes, it work for me now. Many thanks for fixing this so quickly and for making rstanmulicore available.

    Reply
  6. Jake Johnson

    Suppose 3 of 4 chains have finished while one is completely stuck. Is it possible to terminate pstan() early and still preserve those 3 chains?

    Reply
    1. nmv Post author

      I do not believe so.

      If you kill the stuck process manually (via `kill -9` or the Windows Task Manager), I believe that parLapply will throw an error instead of attempting to return the results of the other three processes. You can give it a try though.

      You do, however, raise a difficult issue that I have not thought about: how to checkpoint partial execution of the parallel job. I will have to look into this for the future.

      Reply
      1. Jake Johnson

        Thanks for the quick reply. I should have mentioned at the outset that this is a really great package and has worked flawlessly on both my PC and Mac.

        Reply

Leave a Reply