0

I am trying to have multiple snowfall threads write to the same file using write.table(). In a small number of cases, the rows are broken, i.e I observed that it looks like multiple rows mixed up, which I presume is when two threads try to write to the same file at the same time.

An example is :

require(snowfall)
sfInit(parallel = TRUE, cpus = 16)
sfLapply(1:10000,function(x){
mytable = data.frame(a = c(1,2,3),b = c(4,5,6))
write.table(mytable,"mytable.csv",sep = ',',append = T,col.names = F)
})

is there a way to ensure that only one thread writes to the file at a time, in essence a thread locks the file, writes to it and then releases the lock?

jackStinger
  • 2,035
  • 5
  • 23
  • 36
  • Do you actually need to write the table on the workers? If the csv contains the results of the calculation, it would be much faster to return it as a list and then write the list in the main script. – Buggy Dec 01 '15 at 10:07
  • Yep, I would. The actual input is a list of file names (approx 500, size upto 50 GB total, and the total output file size is approx 3.5 GB). I read each file in, and based on some analysis, write different parts to different files. This will further need to scale out, hence this. – jackStinger Dec 01 '15 at 10:22
  • It seems like you are out of luck. You can write separate files and then join them later like this post here suggests: http://stackoverflow.com/questions/20425071/lock-file-when-writing-to-it-from-parallel-processes-in-r – Buggy Dec 01 '15 at 10:46
  • The `batch` package has some function called mergecsv which deals with merging parallel csv results. https://cran.r-project.org/web/packages/batch/index.html – Buggy Dec 01 '15 at 10:54
  • So that is exactly what I've been thinking about- use `x = system("mkdir abc.lock",ignore.stderr = T)` and remove the lock folder after writing to the file, or appending `x = system("echo $$",intern = T)` to the file name. – jackStinger Dec 01 '15 at 10:57

0 Answers0