6

I need to multi-thread my R application as it takes 5 minutes to run and is only using 15% of the computers available CPU.

An example of a process which takes a while to run is calculating the mean of a very large raster stack containing n layers:

mean = cellStats(raster_layers[[n]], stat='sd', na.rm=TRUE)

Using the parallel library, I can create a new cluster and pass a function to it:

cl <- makeCluster(8, type = "SOCK")
parLapply(cl, raster_layers[[1]], mean_function)
stopCluster(cl)

where mean function is:

mean_function <- function(raster_object)
{
result = cellStats(raster_object, stat='mean', na.rm=TRUE)
return(result)
}

This method works fine except that it can't see the 'raster' package which is required to use cellStats. So it fails saying no function for cellStats. I have tried including the library within the function but this doesnt help.

The raster package comes with a cluster function, and it CAN see the function cellStats, however as far as I can tell, the cluster function must return a raster object and must be passed a single raster object which isn't flexible enough for me, I need to be able to pass a list of objects and return a numeric variable... which I can do with normal clustering using the parallel library if only it can see the raster package functions.

So, does anybody know how I can pass a package to a node with multi-threading in R? Or, how I can return a single value from the raster cluster function perhaps?

Single Entity
  • 2,925
  • 3
  • 37
  • 66

1 Answers1

4

The solution came from Ben Barnes, thank you.

The following code works fine:

mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}

cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))   
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)

Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).

a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.

Single Entity
  • 2,925
  • 3
  • 37
  • 66
  • I am trying to reproduce your example, but there are some variables that I don't have access to. For example what are number_of_bands and a_list? – thiagoveloso May 06 '15 at 02:58
  • Specifically, I would like to use more cores to run the cellStats command on this code: library(raster) # make up some data cmip <- brick(nc=150, nr=114, nl=1872) cmip <- setValues(cmip, matrix(rep(1:17100, 1872), nc=1872)) # get mean values (area average) as data frames cmip.mean <- as.data.frame(cellStats(cmip, mean, na.rm=T)) – thiagoveloso May 06 '15 at 03:52
  • 1
    I have edited my answer a little to make this answer a little more intuitive – Single Entity May 06 '15 at 08:13
  • Thanks for the edit. In my case, instead of rasters I have raster bricks (a raster with several layers). Can your code still be applied in my case? I didn't get any sucess in my attempts. – thiagoveloso May 06 '15 at 08:27
  • 1
    Yes I would imagine so, populate the a_list with each layer of your raster brick, making sure that your procs value has the same number as the number of layers you put into the list. – Single Entity May 06 '15 at 09:01
  • It's not worth it in my case. Making a cluster with my 1872 layers takes much more time than the calculation itself. Thanks for you tips, though. – thiagoveloso May 06 '15 at 09:14