1

Due to memory contraints in a previous script, I modified it following this advice in a similar issue as mine (do not give more data than needed by workers - reading global variables using foreach in R). Unfortunately, now I'm struggling with missing results.

The script iterates over an 1.9M columns matrix, proccess each column and returns one row dataframe (rbind function from foreach combines each row). However, when it print the results, there are less rows (results) than the number of columns and this quantity changes every run. Seemingly, there is no error in the function inside foreach loop as it used to run smoothly in the previous script and no error or warning message pops up.

New Script:

if(!require(R.utils)) { install.packages("R.utils"); require(R.utils)}
if(!require(foreach)) { install.packages("foreach"); require(foreach)}
if(!require(doParallel)) { install.packages("doParallel"); require(doParallel)}
if(!require(data.table)) { install.packages("data.table"); require(data.table)}
registerDoParallel(cores=6)

out.file = "data.result.167_6_inside.out"
out.file2 = "data.result.167_6_outside.out"

data1 = fread("data.txt",sep = "auto", header=FALSE, stringsAsFactors=FALSE,na.strings = "NA")
data2 = transpose(data1)
rm(data1)
data3 = data2[,3: dim(data2)[2]]
levels2 = data2[-1,1:(3-1)]
rm(data2)

colClasses=c(ID="character",Col1="character",Col2="character",Col3="character",Col4="character",Col5="character",Col6="character") 
res_table = dataFrame(colClasses,nrow=0)

write.table(res_table , file=out.file, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
write.table(res_table, file=out.file2, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)

tableRes =  foreach(col1=data3, .combine="rbind") %dopar% {

    id1 = col1[1]
    df2function = data.frame(levels2[,1,drop=F],levels2[,2,drop=F],as.numeric(col1[-1]))
    mode(df2function[,1])="numeric"
    mode(df2function[,2])="numeric"
    values1 <- try (genericFuntion(df2function), TRUE)
        if (is.numeric(try (values1$F, TRUE))) 
        {
            res_table [1,1] = id1
            res_table [1,2] = values1$F[1,1] 
            res_table [1,3] = values1$F[1,2] 
            res_table [1,4] = values1$F[1,3] 
            res_table [1,5] = values1$F[2,2] 
            res_table [1,6] = values1$F[2,3] 
            res_table [1,7] = values1$F[3,3] 
        } else 
        { 
            res_table[1,1] = id1 
            res_table[1,2] = NA 
            res_table[1,3] = NA 
            res_table[1,4] = NA 
            res_table[1,5] = NA 
            res_table[1,6] = NA 
            res_table[1,7] = NA 
        }
write.table(fstats_table, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
return(fstats_table[1,]) 
}
write.table(tableFst, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)

In the previous script, the foreach was that way:

tableRes =  foreach(i=1:length(data3), iter=icount(), .combine="rbind") %dopar% { (same code as above) }

Thus, I would like to know what are the possible causes of this behaviour.

I'm running this script in a cluster asking 80 Gb of memory (and 6 cores in this example). This is the largest amount of RAM I can request one a single node to be sure that script will not fail due to the lack of memory. (Each node has a pair of 14-core Intel Xeon skylake CPUs 2.6GHz, 128GB of RAM; OS - RHEL 7)

Ps 1: Although the new script is not paging anymore (even with more than 8 cores), seems that each child process still loading large amounts of data in the memory (~6 Gb) as I tracked using top command.

Ps 2: The new script is printing the results inside and outside the foreach loop to track if the loss of data occurs during or after the loop finishes and as I noticed every run gives me different amount of printed results inside and outside the loop.

P3: The fastest run was based on 20 cores (6 sec for 1000 iterations) and the slowest was 56 sec on a single core (tests performed using microbenchmark with 10 replications). However, more cores leads to less results being returned in the full matrix (1.9M columns).

I really appreciate any help you can provide,

Jwojwo
  • 49
  • 5
  • So, your question is why you get only a part of the results you want? Is the result okay when you use only 1 core? – F. Privé Aug 05 '17 at 07:57
  • Some questions: *1* How do you think the new script is helping with memory constraints? You read in the entire data at once. *2* Why do you print to output in each `foreach` and after the `foreach`? You rbind all of your data afterwards, which is to say you store all your 'new' data in memory again. *3* Why do you have a `try` statement? This could mask failed operations/nodes. – CPak Aug 05 '17 at 09:36
  • Some suggestions: *1* You should attempt a vectorized version of your code. I don't see where you're saving memory with the current *new* script. You might actually be slowing things down by initiating a new worker for each column. *2* You should print to output only after `foreach`. I've never printed to a shared output file in a `foreach` statement but I can imagine problems cropping up by multiple workers sharing the same file. *3* Get rid of the `try` statement to see if your code is *truly* error-free. – CPak Aug 05 '17 at 09:41
  • More suggestions: *4* I second @F.Privé, you should show your code on only 1 core. *5* You could also write a test script where you simply print each column to a shared output file, that is, do *nothing* to the data except print. This should clear up if you're losing data based on printing to a shared output file. *6* Rewrite code to read data line-by-line, operate on data line-by-line, and output to file line-by-line. This is a situation where `append=T` makes sense. – CPak Aug 05 '17 at 09:48
  • @F.Privé Exactly. Yes, it is. – Jwojwo Aug 06 '17 at 16:17
  • @ChiPak 1 - To be honest, I was expecting the workers would not get the whole matrix in this new script, but it looks like they still get too much data (monitored through the top). But for some reason beyond my comprehension, the script as a whole went on to use less memory with this modification. 2 - To test if the loss of information occurs during or after loop (Ps 2). 3 - Because for some columns it is not possible to estimate the variance and the function returns an error. – Jwojwo Aug 06 '17 at 16:29
  • @ChiPak 1 - I had a vectorized function In a previous version of this script, I went in a parallel approach because it takes days to run (on a 1.9M matrix it takes almost 5-6 days to run). Memory is not really an issue with less cores, but I don't want to waste memory. 2 - I agree, usually I do that, only printed that way to test. 3 - It is not and without try-catch the script will fail in some columns where is not possible to estimate the variance. – Jwojwo Aug 06 '17 at 17:04
  • @ChiPak 4 - I tried to find the vectorized version unsuccessfully. It's been awhile since the last time I used it. This script has evolved over the years to analyze larger matrices in less time. 5 - Since the inside and outside loop printing give me the same results, I don't think the shared output is an issue. 6 - I'm using append only because I need it in the inside printing, I'll get rid of it. – Jwojwo Aug 06 '17 at 17:33

0 Answers0