I am processing a large global precipitation data-set (with very fine spatial resolution) to calculate the Standardized Precipitation Index using the SPEI package in R.
I general my question refers to the optimization of data processing using very large data. I found some discussions in other posts (here, here, and here fo instance), but nothing seems similar to my case.
My input is a matrix with more than 20 years of monthly observations (>20*12 rows) of precipitation time series for > 1,000,000 points (columns). The calculation of the SPI does a series of steps for each of the time series and calculates the index as standard deviation from the median. The output is a list, with a results matrix ($fitted) having the same size of the input matrix.
Here an example of the code:
require(SPEI)
#generating a random values matrix
data<-replicate(200, rnorm(240))
# erasing negative values
data[data<=0]=0
spi6 <- spi(data, 6, kernel = list(type = 'rectangular', shift = 0), distribution = 'PearsonIII', fit = 'ub-pwm', na.rm = FALSE, ref.start=NULL, ref.end=NULL, x=FALSE, params=NULL)
#testing the results
plot(spi6$fitted[,67])
#taking my results out
results <- t(spi6$fitted)
This script works perfectly, but if I increase the number of points (columns in this case) the processing time increases exponentially. Till reaching a memory shortage problem:
Warning messages:
1: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
Reached total allocation of 16253Mb: see help(memory.size)
2: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
Reached total allocation of 16253Mb: see help(memory.size)
3: In NextMethod("[<-") :
Reached total allocation of 16253Mb: see help(memory.size)
4: In NextMethod("[<-") :
Reached total allocation of 16253Mb: see help(memory.size)
5: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
Reached total allocation of 16253Mb: see help(memory.size)
6: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
Reached total allocation of 16253Mb: see help(memory.size)
7: In NextMethod("[<-") :
Reached total allocation of 16253Mb: see help(memory.size)
8: In NextMethod("[<-") :
Reached total allocation of 16253Mb: see help(memory.size)
How can I split my input matrix (or split the procedure over the input matrix) in order to parallel process groups of column vectors (each of them being a full time series of a specific point), without loosing information (or messing my data around)? Thanks.