Due to memory contraints in a previous script, I modified it following this advice in a similar issue as mine (do not give more data than needed by workers - reading global variables using foreach in R). Unfortunately, now I'm struggling with missing results.
The script iterates over an 1.9M columns matrix, proccess each column and returns one row dataframe (rbind function from foreach combines each row). However, when it print the results, there are less rows (results) than the number of columns and this quantity changes every run. Seemingly, there is no error in the function inside foreach loop as it used to run smoothly in the previous script and no error or warning message pops up.
New Script:
if(!require(R.utils)) { install.packages("R.utils"); require(R.utils)}
if(!require(foreach)) { install.packages("foreach"); require(foreach)}
if(!require(doParallel)) { install.packages("doParallel"); require(doParallel)}
if(!require(data.table)) { install.packages("data.table"); require(data.table)}
registerDoParallel(cores=6)
out.file = "data.result.167_6_inside.out"
out.file2 = "data.result.167_6_outside.out"
data1 = fread("data.txt",sep = "auto", header=FALSE, stringsAsFactors=FALSE,na.strings = "NA")
data2 = transpose(data1)
rm(data1)
data3 = data2[,3: dim(data2)[2]]
levels2 = data2[-1,1:(3-1)]
rm(data2)
colClasses=c(ID="character",Col1="character",Col2="character",Col3="character",Col4="character",Col5="character",Col6="character")
res_table = dataFrame(colClasses,nrow=0)
write.table(res_table , file=out.file, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
write.table(res_table, file=out.file2, append = T, col.names=TRUE, row.names=FALSE, quote=FALSE)
tableRes = foreach(col1=data3, .combine="rbind") %dopar% {
id1 = col1[1]
df2function = data.frame(levels2[,1,drop=F],levels2[,2,drop=F],as.numeric(col1[-1]))
mode(df2function[,1])="numeric"
mode(df2function[,2])="numeric"
values1 <- try (genericFuntion(df2function), TRUE)
if (is.numeric(try (values1$F, TRUE)))
{
res_table [1,1] = id1
res_table [1,2] = values1$F[1,1]
res_table [1,3] = values1$F[1,2]
res_table [1,4] = values1$F[1,3]
res_table [1,5] = values1$F[2,2]
res_table [1,6] = values1$F[2,3]
res_table [1,7] = values1$F[3,3]
} else
{
res_table[1,1] = id1
res_table[1,2] = NA
res_table[1,3] = NA
res_table[1,4] = NA
res_table[1,5] = NA
res_table[1,6] = NA
res_table[1,7] = NA
}
write.table(fstats_table, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
return(fstats_table[1,])
}
write.table(tableFst, file=out.file, append = T, col.names=FALSE, row.names=FALSE, quote=FALSE)
In the previous script, the foreach was that way:
tableRes = foreach(i=1:length(data3), iter=icount(), .combine="rbind") %dopar% { (same code as above) }
Thus, I would like to know what are the possible causes of this behaviour.
I'm running this script in a cluster asking 80 Gb of memory (and 6 cores in this example). This is the largest amount of RAM I can request one a single node to be sure that script will not fail due to the lack of memory. (Each node has a pair of 14-core Intel Xeon skylake CPUs 2.6GHz, 128GB of RAM; OS - RHEL 7)
Ps 1: Although the new script is not paging anymore (even with more than 8 cores), seems that each child process still loading large amounts of data in the memory (~6 Gb) as I tracked using top command.
Ps 2: The new script is printing the results inside and outside the foreach loop to track if the loss of data occurs during or after the loop finishes and as I noticed every run gives me different amount of printed results inside and outside the loop.
P3: The fastest run was based on 20 cores (6 sec for 1000 iterations) and the slowest was 56 sec on a single core (tests performed using microbenchmark with 10 replications). However, more cores leads to less results being returned in the full matrix (1.9M columns).
I really appreciate any help you can provide,