Optimization/Parallelization R - Handle large data sets calculating SPI in R

Question

I am processing a large global precipitation data-set (with very fine spatial resolution) to calculate the Standardized Precipitation Index using the SPEI package in R.

I general my question refers to the optimization of data processing using very large data. I found some discussions in other posts (here, here, and here fo instance), but nothing seems similar to my case.

My input is a matrix with more than 20 years of monthly observations (>20*12 rows) of precipitation time series for > 1,000,000 points (columns). The calculation of the SPI does a series of steps for each of the time series and calculates the index as standard deviation from the median. The output is a list, with a results matrix ($fitted) having the same size of the input matrix.

Here an example of the code:

require(SPEI)

#generating a random values matrix 
data<-replicate(200, rnorm(240))
# erasing negative values
data[data<=0]=0

spi6 <- spi(data, 6, kernel = list(type = 'rectangular', shift = 0), distribution = 'PearsonIII', fit = 'ub-pwm', na.rm = FALSE, ref.start=NULL, ref.end=NULL, x=FALSE, params=NULL)

#testing the results
plot(spi6$fitted[,67])
#taking my results out
results <- t(spi6$fitted)

This script works perfectly, but if I increase the number of points (columns in this case) the processing time increases exponentially. Till reaching a memory shortage problem:

Warning messages:
1: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
  Reached total allocation of 16253Mb: see help(memory.size)
2: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
  Reached total allocation of 16253Mb: see help(memory.size)
3: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
4: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
5: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
  Reached total allocation of 16253Mb: see help(memory.size)
6: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
  Reached total allocation of 16253Mb: see help(memory.size)
7: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
8: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)

How can I split my input matrix (or split the procedure over the input matrix) in order to parallel process groups of column vectors (each of them being a full time series of a specific point), without loosing information (or messing my data around)? Thanks.

Santiago Beguería · Accepted Answer · 2016-12-12T20:47:31.783

I find a pretty linear, not exponential, processing time:

system.time(spi(data[,1], 6))
system.time(spi(data[,1:10], 6))
system.time(spi(data[,1:100], 6))

If you see an exponential growth, it is probably caused by a problem of excess of RAM allocation.

One easy solution is to split the computation over the matrix:

spi6 <- data*NA
system.time(
  for (i in 1:100) spi6[,i] <- spi(data[,i], 6)$fitted
)

or, with a similar efficiency:

system.time(
  spi6 <- apply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)

As you can see, the computation time is pretty similar to the original option in which you provide the whole matrix as input to the spi() function. But, if you are experiencing memory problems, this might solve them.

On the other hand, if you have access to a multi-core computer (as it is most likely the case nowadays), you may see an improvement in computation time by parallelising the calculation:

library(snowfall)
sfInit(parallel=TRUE, cpus=2)
sfLibrary(SPEI)

system.time(
  spi6 <- sfApply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)

sfStop()

You may want to set a higher value to ncpu for higher speed gain, depending on how many threads your computer supports. However, sfApply, will not solve your memory problem with very large datasets. This is so because the function splits the dataset between the number of cpus allocated. Since the total memory of the system does not vary, you will run out of memory just the same.

The solution is to merge both approaches: split your dataset and then parallelize. Something like this:

data <- replicate(10000, rnorm(240))

sfInit(parallel=TRUE, cpus=10)
sfLibrary(SPEI)

spi6 <- data*NA
for (i in 0:9) {
    chunk <- (i*1000)+(1:1000)
    spi6[,chunk] <- sfApply(data[,chunk], 2, function(x) spi(x, 6)$fitted)
} 

sfStop()

Now, you just need to find the maximum size of the chunk, that is the number of data raws you pass to sfApply, in order to do not overflow your available RAM. That's pretty easy with some try and error.

Thanks Santiago. The parallel solution (the last one) speeds up the process considerably, but still creates memory issues. The one using apply worked just great. — Nemesi, Dec 12 '16 at 16:53
You're right, paralellizing does not solve the memory problem. I edit my answer to reflect this. — Santiago Beguería, Dec 12 '16 at 20:20

Optimization/Parallelization R - Handle large data sets calculating SPI in R

1 Answers1

Linked