9

I have seen plenty of questions regarding writing to file, but I am wondering what is the most robust way to open a text file, append some data and then close it again when you are going to be writing from many connections (i.e. in a parallel computing situation), and can't guarantee when each connection will want to write to the file.

For instance in the following toy example, which uses just the cores on my desktop, it seems to work ok, but I am wondering if this method will be prone to failure if the writes get longer and the number of processes writing to the file increases (especially across a network share where there may be some latency).

Can anyone suggest a robust, definitive way that connections should be opened, written to and then closed when there may be other slave processes that want to write to the file at the same time?

require(doParallel)
require(doRNG)

ncores <- 7
cl <- makeCluster( ncores , outfile = "" )
registerDoParallel( cl )

res <- foreach( j = 1:100 , .verbose = TRUE , .inorder= FALSE ) %dorng%{
    d <- matrix( rnorm( 1e3 , j ) , nrow = 1 )
    conn <- file( "~/output.txt" , open = "a" )
    write.table( d , conn , append = TRUE , col.names = FALSE )
    close( conn )
}

I am looking for the best way to do this, or if there is even a best way. Perhaps R and foreach take care of what I would call writelock issues automagically?

Thanks.

Community
  • 1
  • 1
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • 1
    Not knowing R, I cannot make a definitive answer, but an efficient way with other languages is to devote one thread to IO, and set up a queue of write commands for that IO thread to process. That thread can do write in batches, thus reducing the time spent by it. – didierc Mar 08 '13 at 22:08
  • basically, that would be an instance of the producer-consumer pattern – didierc Mar 08 '13 at 22:14
  • @didierc Thank for the suggestions. I should make it clear that I am looking for an `R` centric answer. Especially in the scenario when it is multiple nodes with multiple cores trying to access the same file on a network share. Maybe what I have posted is perfectly adequate. TBH I should have probably found a scenario in which it broke first but I am preempting myself! – Simon O'Hanlon Mar 09 '13 at 09:53
  • you're not doing anything wrong: you've correctly tagted your question. But not seeing many answers, I thought I could perhaps help you somehow – didierc Mar 09 '13 at 13:33
  • If you are using a POSIX filesystem and your append is less than PIPE_BUF bytes (4k on Linux) then the append operation is atomic. See [Is file append atomic in UNIX?](http://stackoverflow.com/a/1154599/3429373). That's assuming R doesn't chop up the input into multiple chunks. – BeingQuisitive Feb 05 '15 at 20:21

3 Answers3

6

The foreach package doesn't provide a mechanism for file locking that would prevent multiple workers from writing to the same file at the same time. The result of doing that is going to depend on your operating system and file system. I'd be particularly worried about the results when using a distributed file system such as NFS.

Instead, I would change the way you open the output file to include the process ID of the worker:

conn <- file( sprintf("~/output_%d.txt" , Sys.getpid()) , open = "a" )

You could concatenate the files after the foreach loop returns if desired.

Of course, if you were using multiple machines, you might have two workers with the same process ID, so you could include the hostname in the file name as well, using Sys.info()[['nodename']], for example.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • Thanks Steve. I like your reasoning and this seems like a sensible, robust solution. I'm going to leave the question unanswered for a while (1 month?), in the hope it might garner a few more suggestions, but after that I will choose one of the posted answers to close it. Thanks! – Simon O'Hanlon Mar 11 '13 at 21:04
4

A variation on the method proposed by @didierc is to write the matrices from a combine function:

conn <- file("~/output.txt", "w")
wtab <- function(conn, d) {
    write.table(d, conn, col.names=FALSE)
    conn
}

res <- foreach(j = 1:100, .init=conn, .combine='wtab') %dorng% {
    matrix( rnorm( 1e3 , j ) , nrow = 1 )
}

close(conn)

This technique is particular useful when used with a parallel backend such as doSNOW and doMPI that can call the combine function on-the-fly as results are sent back to the master.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • I think I'd better buy your book! – Simon O'Hanlon Mar 11 '13 at 22:35
  • Sometimes the remote server is restarted and I want to make sure I don't lose anything that's already been processed. Will this solution work in this case? Or will I only get results written after the loop is finished? – rrs Feb 21 '18 at 23:59
  • My other problem is that I need to write multiple files inside the loop. – rrs Feb 22 '18 at 04:02
  • @rrs Yes, if you use doSNOW or doMPI, the combine function is called as the results are returned (on-the-fly), so you don't lose results that were already sent to the master. – Steve Weston Feb 22 '18 at 21:24
  • @rrs Multiple output files could be handled by putting all of the open file objects in a list and looping over that list in the combine function. – Steve Weston Feb 22 '18 at 21:26
  • @SteveWeston I appreciate your help here. I tried the above but it doesn't seem to write until the loop has ended. I'm using doParallel on a Mac. Will that not combine results as processed? – rrs Feb 22 '18 at 23:00
  • @SteveWeston doSNOW seems to work but not doParallel. – rrs Feb 23 '18 at 02:29
  • @rrs doParallel doesn't call combine on-the-fly. – Steve Weston Feb 23 '18 at 14:48
  • @SteveWeston thank you. I finally realized that after reading a bunch of questions about the difference between the two. – rrs Feb 23 '18 at 15:18
1

You could perhaps try something like that instead:

res <- foreach( j = 1:100 , .verbose = TRUE , .inorder= FALSE ) %dorng%{
    matrix( rnorm( 1e3 , j ) , nrow = 1 )
}

conn <- file("~/output.txt", open = "a")
apply(res, 1, function (x, output) {
    write.table( x , conn , append = TRUE , col.names = FALSE )
  }, conn)

close(conn)

Source: foreach row in a dataframe

Community
  • 1
  • 1
didierc
  • 14,572
  • 3
  • 32
  • 52
  • I'm sure someone will show up to correct me if there's a mistake in there. – didierc Mar 09 '13 at 14:03
  • 1
    Hi @didierc. Code looks good. I should have clarified in my question that I am looking to output from within the loop. I want to ensure that I am saving results as and when they finish, so that a crash or error on a future iteration of the loop means prior results aren't lost (sometimes you have more loops than nodes, so slaves are queueing for more jobs). It is highly interesting to reflect on whether allowing all loops to complete and then write from one process is most robust, or if as results complete? +1 for solid working code. I hope to get some more opinions. Thanks for your interest! – Simon O'Hanlon Mar 09 '13 at 15:57
  • 1
    I am wondering (for instance) if there is a *good* way for slave processes to first check if they can write to a file, and if they can't to wait until they can before continuing, with some pre-specified timeout period. Or if this is even necessary in `R` and the `foreach` loop. I imagine there must be some kind of locking issues when more than one machine are trying to write to the same file? – Simon O'Hanlon Mar 09 '13 at 15:59
  • 1
    I guess I was right in my initial idea: you need some message queue mechanism - a list should do, where each task would write its result as soon as it is finished, and a different task to process that list. It appears that the `%dorng%` or the `%dopar%` both work on arrays, which would rule them out. I had a quick look at the rmr package providing mapreduce functionalities, but it seems too complex for what you are trying to achieve. – didierc Mar 09 '13 at 16:26
  • Yes, I suppose that would be the outline of how a desired system would work. – Simon O'Hanlon Mar 09 '13 at 16:27
  • I don't know about the internals of R, or the `%do...%` packages. You intuition that there would be a contention around the file output because of a writelock is most probably correct, which is the reason of the producer-consumer idea. Sorry I cannot be of further help. – didierc Mar 09 '13 at 16:32