I'm trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table
. The table contains data for many different groups and I'm trying to move from a for
loop to a foreach
loop using the doParallel
package to take advantage of the multicore processor installed.
The problem I am experiencing relates to memory and how the new R processes seem to consume large quantities of it. I think that what is happening is that the large data.table
containing ALL data is copied into each new process, hence I run out of RAM and Windows starts swapping to disk.
I've created a simplified reproducible example which replicates my problem, but with less data and less analysis inside the loop. It would be ideal if a solution existed which could only farm out the data to the worker processes on demand, or sharing the memory already used between cores. Alternatively some kind of solution may already exist to split the big data into 4 chunks and pass these to the cores so they have a subset to work with.
A similar question has previously been posted here on Stackoverflow however I cannot make use of the bigmemory
solution offered as my data contains a character field. I will look further into the iterators
package, however I'd appreciate any suggestions from members with experience of this problem in practice.
rm(list=ls())
library(data.table)
num.series = 40 # can customise the size of the problem (x10 eats my RAM)
num.periods = 200 # can customise the size of the problem (x10 eats my RAM)
dt.all = data.table(
grp = rep(1:num.series,each=num.periods),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
library(doParallel)
registerDoParallel(4)
foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind") %dopar%
{
dt.sub = dt.all[grp == grp]
f_lm(dt.sub, grp)
}
detach(package:doParallel)