1

I have a data set with many missing observations and I used the Amelia package to create imputed data sets. I'd like to know if it's possible to run the same model in parallel with a different data set per chain and combine the results into a single Stan object.

# Load packages
library(Amelia)
library(rstan)

# Load built-in data
data(freetrade)

# Create 2 imputed data sets (polity is an ordinal variable)
df.imp <- amelia(freetrade, m = 2, ords = "polity")

# Check the first data set
head(df.imp$imputations[[1]])

# Run the model in Stan
code <- '
    data {
int<lower=0> N;          
vector[N] tariff;     
vector[N] polity;      
}
    parameters {
real b0;                  
real b1;           
real<lower=0> sigma;       
}
    model {
b0 ~ normal(0,100);  
b1 ~ normal(0,100); 
tariff ~ normal(b0 + b1 * polity, sigma);    
}
'

# Create a list from the first and second data sets
df1 <- list(N = nrow(df.imp$imputations[[1]]),
            tariff = df.imp$imputations[[1]]$tariff,
            polity = df.imp$imputations[[1]]$polity)

df2 <- list(N = nrow(df.imp$imputations[[2]]),
            tariff = df.imp$imputations[[2]]$tariff,
            polity = df.imp$imputations[[2]]$polity)

# Run the model
m1 <- stan(model_code = code, data = df1, chains = 1, iter = 1000) 

My question is how to run the last line of code on both data sets at the same time, running 2 chains and combining the output with the same stan() function. Any suggestions?

danilofreire
  • 503
  • 1
  • 5
  • 18

2 Answers2

5

You can run the models separately, and then combine them using sflist2stanfit().

E.g.

seed <- 12345
s1 <- stan_model(model_code = code) # compile the model

m1 <- sampling(object = s1, data = df1, chains = 1,
               seed = seed, chain_id = 1, iter = 1000) 
m2 <- sampling(object = s1, data = df2, chains = 1,
               seed = seed, chain_id = 2, iter = 1000)

f12 <- sflist2stanfit(list(m1, m2))
danilofreire
  • 503
  • 1
  • 5
  • 18
tiffany
  • 503
  • 2
  • 9
  • Thanks a lot for your answer. It works great here and it's very intuitive. I didn't know about `sflist2stanfit()`, and it does the job pretty well. It's just not very efficient because it translates the code into C++ twice and cannot be run in parallel, but it's something I'll keep in mind. Thanks again. – danilofreire Jan 22 '15 at 00:42
  • 3
    It is only necessary to compile the model once and then different datasets can be passed to it, whether in parallel or serially. For example, `m <- stan_model(model_code = code)` and then pass `m` to the `fit` argument of `stan()`. – Ben Goodrich Jan 22 '15 at 00:56
  • Thanks for clarifying that, Ben. I can also pass it to `sampling(object = m, ...)`, correct? It seems to work here. – danilofreire Jan 22 '15 at 02:07
2

You will have to use one of the packages for Parallel computing in R. According to this post, it should then work: Will RStan run on a supercomputer?

Here is an example that may work (I use this code with JAGS, will test it with Stan later):

library( doParallel )
cl <- makeCluster( 2 ) # for 2 processes
registerDoParallel( cl )

library(rstan)

# make a function to combine the results
stan.combine <- function(...) { return( sflist2stanfit( list(...) )  ) }

mydatalist <- list(df1 , df2)    
myseeds <- c(123, 456)

# now start the chains
nchains <- 2
m_both <- foreach(i=1:nchains , 
              .packages = c( 'rstan' ), 
              .combine = "stan.combine") %dopar% {
             result <- stan(model_code = code, 
                   data = mydatalist[[i]], # use the right dataset
                   seed=myseeds[i],        # use different seeds
                   chains = 1, iter = 1000) 
             return(result) }

Let me know whether it works with Stan. As I said, I haven't tested it yet.

Community
  • 1
  • 1
dwcoder
  • 478
  • 2
  • 8
  • Thanks for the answer, dwcoder. However, the code doesn't work here. I copied and pasted it and it gives some errors: `Error: unexpected 'in' in "m_both <- foreach(i in"`, then `Error: unexpected ',' in " .packages = c( 'rstan' ),"`, and finally `Error: unexpected ')' in " )"` – danilofreire Jan 22 '15 at 00:39
  • Let me have a look at it, I have stan set up now. – dwcoder Jan 22 '15 at 08:57
  • 1
    I fixed it. There was a problem: `foreach( 1 in 1:nchains )` should have been foreach`( 1=1:nchains)`. – dwcoder Jan 22 '15 at 09:29
  • 1
    I need to point out that @user2980360's code won't run in parallel, it will run the two chains one after the other. If you want to use two cores, you need to use a package like `doParallel`. – dwcoder Jan 22 '15 at 09:31