Automatic creation and use of custom made function in R - in for each loop - storing the result in one DF - 3D array

Question

few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of

 eval(parse(text = Function text))

here is the link: Automatic creation and use of custom made function in R. This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.

Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.

what I would like to do is run the evaluation of the indices in parallel to reduce the computation time. here a dummy example to what I would like to emulate, but not in parallel computing.

# create a fake  matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4

image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))




#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
          "HYPR_IMG[,,3] + HYPR_IMG[,,2]",
          "HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
          "HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))


# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
  IDX_ID=IDX_DF$IDXname[i]
  IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
  IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
  eval(parse(text = IDXFunc_call))
  IDX_VAL=IDXfun_tmp (HYPR_IMG)
  image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID) 
  temp_DF=as.vector(IDX_VAL)
  Store_DF=cbind(Store_DF,temp_DF)
  names(Store_DF)[i+1] <- as.vector(IDX_ID)
}

my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.

Is it really possible ? Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?

Thanks a lot!!!

Oliver · Accepted Answer · 2019-02-12T18:22:05.513

For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial. Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.

cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
  IDX_ID <- x[["IDXname"]]
  eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
  IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
  names(IDX_VAL) <- IDX_ID
  IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)

Please note the stopCluster(cl) which is important to shut down any loose R sessions. Benchmark results (4 tiny cores):

Unit: milliseconds
     expr      min       lq      mean   median       uq      max neval
     Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623   100
 Parallel 1.382324 1.491634  2.038024 1.554690 1.907728 18.23942   100

For replications of benchmarks the code has been provided below:

cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
  Loop = {
    Store_DF=data.frame(NA)
    for (i in 1: length(IDX_DF$IDXname)) {
      IDX_ID = IDX_DF$IDXname[i]
      IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
      eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
      IDX_VAL = IDXfun_tmp(HYPR_IMG)
      #Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
      #image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID) 
      temp_DF = as.vector(IDX_VAL)
      Store_DF = cbind(Store_DF,temp_DF)
      names(Store_DF)[i+1] <- as.vector(IDX_ID)
    }
    rm(Store_DF)
  },
  Parallel = {
    result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
      IDX_ID <- x[["IDXname"]]
      eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
      IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
      names(IDX_VAL) <- IDX_ID
      IDX_VAL
    }, IMAGES = HYPR_IMG)
    colnames(result) = IDXname
    rm(result)
  }
)
parallel::stopCluster(cl)

Edit: Using the foreach package

After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:

The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.

Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.

cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators 
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row") 
library(foreach)
result <- 
  foreach(
        #Supply the iterator
        row = rowIterator, 

        #Specify if the calculations needs to be in order. If not then we can get better performance not doing so
        .inorder = FALSE, 

        #In most foreach loops you will have to export the data you need for the calculations
        # it worked without doing so for me, in which case it is faster if the exported stuff is large
        #.export = c("HYPR_IMG"), 

        #We need to say how the output should be merged. If nothing is given it will be output as a list
        #data.table rbindlist is faster than rbind (returns a data.table)

        .combine = function(...)data.table::rbindlist(list(...)) ,
        #otherwise we could've used:
        #.combine = rbind 

        #if we dont use rbind or cbind (i used data.table::rbindlist for speed)
        # we will have to tell if it can take more than 1 argument 
        .multicombine = TRUE

        ) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%

  IDX_ID <- row[["IDXname"]]
  eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
  IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
  data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID], 
                            indx~ID, 
                            value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)

Thanks, the script is working but I have few question if you dont' mind: 1. I tried to mesure the computation using `Syst.time()` before and after the `for` and `parAplly` loop: what i get is that the `for` loop took _Time difference of 0.4963591 secs_ instead the `ParApply` _Time difference of 7.676425 secs_. Why is this happening? . thanks — Francesco, Feb 10 '19 at 09:13
Hello Francesco, if an answer gives you what you are looking for, remember to give it an upvote. I suggest using system.time( expression ) for timing instead, or microbenchmark (from the microbenchmark) package if the time allows. Have you tried copying my benchmarking example above (the last R code box) and checking if you get the same result? Microbenchmark will run the code 100 times, giving a better estimate of the run time. — Oliver, Feb 10 '19 at 09:26
hello, Oliver. I tried to use your benchmarking example and the result are more or less the same you posted. the issue is that if I run the overall script, the total time of parallel scripts is 8 hours while for loop 4 hours only. In both cases is to much because I have to prosses 10 times the data i put in the test scripts. Could it be a RAM problem? or Could it be that passing the information to the single core slow down the overall computing? thanks — Francesco, Feb 11 '19 at 08:12
Hello Francesco. That does seem odd, and without knowing the code itself it is hard for me to provide a certain answer. If you are on a mac/linux device, conflict when trying to grab data might cause some issue. On windows it running out of memory is punished by the session stopping resulting in a "cannot allocate memory of size" error. If this is the cause splitting the problem into small sub-problems (eg: iterate over 1000 lines at a time) could alleviate this to some degree. — Oliver, Feb 11 '19 at 13:02
You could try running your script at a small subset of the test data, that is unlikely to cause memory issue, say 1/4 or 1/8 of the data size, this might indicate if the size is the actual problem. If the problem persists for the smaller size, memory is unlikely to be the issue. — Oliver, Feb 11 '19 at 13:12
Thanks Oliver, I´ll try. Do you think that putting effort in changing the command from `parApply` to `foreach` command would do any better? — Francesco, Feb 11 '19 at 15:54
As a general guideline: The foreach package cannot be faster then an implementation through the parallel package, given constraining factors. The foreach package utilizes the iterators package to carry information to each iteration, but these have some overhead resulting in lower performance. That said, it seems the team has used these to reduce the memory usage, if the iterator itself is the memory problem. Short: Maybe but it will take some fiddling. I might edit the answer to include an example of the foreach implementation. — Oliver, Feb 11 '19 at 16:19
If you are willing to it, I will test it. In any case, thanks a lot — Francesco, Feb 12 '19 at 15:19
I have added an example using foreach. **Note** the code is alot slower! it seems odd but it actually ends up being slower than the for loop. I believe this is due to the iterator, which will be helpful, if you are iterating over huge amount of data, **and** perform a large amount of calculations in each iteration. It might be because my `foreach` code is simply bad however. I am working more directly with the parallel, futures, promises and ipc packages. The latter 3 being more complicated. — Oliver, Feb 12 '19 at 18:29

Automatic creation and use of custom made function in R - in for each loop - storing the result in one DF - 3D array

1 Answers1

Edit: Using the foreach package