2

I have a large time series data set that normally takes 4 hrs to process using sequential processing through the 1800 time series. I'm looking for a way to use several cores to reduce this time because I have a number of these data sets to get through on a regular basis.

The R code I am using for the sequential processing is below. There are 4 files containing different data set, and each files contains over 1800 series. I have been trying to use doParallel to analyze each time series independently and concatenate the results into a single file. Even a CSV file would do.

# load the dataset
files <- c("3MH Clean", "3MH", "6MH", "12MH")
for (j in 1:4)
{
  title <- paste("\n\n\n Evaluation of", files[j], " - Started at", date(), "\n\n\n")
  cat(title)

  History <- read.csv(paste(files[j],"csv", sep="."))

  # output forecast to XLSX
  outwb <- createWorkbook()
  sheet <- createSheet(outwb, sheetName = paste(files[j], " - ETS"))                                      
  Item <- unique(unlist(History$Item))

 for (i in 1:length(Item))
 {
    title <- paste("Evaluation of item ", Item[i], "-", i, "of", length(Item),"\n")
    cat(title)
    data <- subset(History, Item == Item[i])
    dates <- unique(unlist(data$Date))
    d <- as.Date(dates, format("%d/%m/%Y"))
    data.ts <- ts(data$Volume, frequency=12, start=c(as.numeric(format(d[1],"%Y")), as.numeric(format(d[1],"%m"))))
    try(data.ets <- ets(data.ts))
    try(forecast.ets <- forecast.ets(data.ets, h=24))
    IL <-c(Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i],Item[i])
    ets.df <- data.frame(forecast.ets)
    ets.df$Item <- IL
    r <- 24*(i-1)+2
    addDataFrame(ets.df, sheet, col.names=FALSE, startRow=r)
  }

  title <- paste("\n\n\n Evaluation of", files[j], " - Completed at", date(), "\n\n\n")
  cat(title)
  saveWorkbook(outwb, paste(files[j],"xlsx",sep='.'))
}
svick
  • 236,525
  • 50
  • 385
  • 514
  • 1
    I'm guessing that instead of parallelizing your double for loop, it would be much better to try to vectorize your analysis as much as possible, but without a small reproducible example, it's hard to follow your analysis. See [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – talat Aug 08 '14 at 13:18

1 Answers1

2

I mirror the sentiment in the comment, that the process should be vectorized as much as possible. In the absence of a reproducible example, I'll have to make some assumptions about your data. I assume that the series are stacked one on top of the other, with a variable indicating the series. If they are not, if the series are in separate columns, you can melt the data frame using reshape2, and then use the code below.

If you are using a linux or mac box, then you can use the parallel package and mclapply if you manage to vectorize your code more. I'm partial to data.table -if you are unfamiliar, then this may be a steep climb.

require(forecast)
require(data.table)
require(parallel)

#create 10 sample series -not 1800, but you get the idea.
data <- data.table(series=rep(1:10,each=60),
                   date=seq(as.Date("01/08/2009",format("%d/%m/%Y")),length.out=60,by="1 month"),
                   Volume=rep(rnorm(10),each=60)+rnorm(600))
#Define some functions to get the job done.    
getModel <- function(s) {
  data[order(date)][series==s][,ets(ts(Volume,start=as.yearmon(min(date),format("%d/%m/%Y")),frequency=12))]
}
getForecast <- function(s,forward=24) {
  model <- getModel(s)
  fc <- forecast(model,h=forward)
  return(data.frame(fc))
}
#Write the forecasts m at a time, where m is the number of cores.
Forecasts <- mclapply(1:10,getForecast)

With your list of data frames, you can do something like:

mclapply(Forecasts, function(x) {addDataFrame(x, sheet, col.names=FALSE, startRow=r))
mgriebe
  • 908
  • 5
  • 8
  • Sorry for the slow response @mgriebe. This is very helpful. I am struggling with one other aspect. The historical data comes from an ERP system in the form: Factor_1 Factor_2 Factor_3 Date_1 Date_2 Date_3 ... Date_n F1 F2 F3 Qty_1 Qty_2 Qty_3 ... Qty_n – user1354798 Nov 16 '14 at 17:26
  • Ran out of time. Correct comment should be as below. Sorry for the slow response @mgriebe. This is very helpful. I am struggling with one other aspect. The historical data comes from an ERP system in the form: Factor_1 Factor_2 Factor_3 Date_1 Date_2 Date_3 ... Date_n F1 F2 F3 Qty_1 Qty_2 Qty_3 ... Qty_n Any suggestions on how to transform this into? Factor_1 Factor_2 Factor_3 Date Volume F1 F2 F3 D_1 Qty_1 F1 F2 F3 D_2 Qty_2 F1 F2 F3 D_N Qty_N Your help will be greatly appreciated. – user1354798 Nov 16 '14 at 17:32