0

Is there a performance difference for file I/O between the following two approaches?

  • Use a queue that is filled by producers and start a task writing to disk after all data has arrived
  • Have a task writing to disk in parallel to producers

The data is written to different files and multiple directories. A separate task for the I/O and Parallel.ForEach would be used in both cases.

I would assume that the second version would perform better, theoretically the producers and the I/O are really concurrent. Since I/O causes interrupts to the calling process I was wondering if there would be a down-side. This might cause overhead that outweighs the benefits of parallelism.

Are there situations were I should favor the first solution over the second?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
  • 1
    Try it and see. It's the only way to be sure. I would guess that generally parallelizing disk IO is probably not going to be a notable benefit as the OS already tends to do a lot of caching behind the scenes. – Glorin Oakenfoot Mar 02 '16 at 20:01
  • 3
    Write the code both ways, get out a stopwatch, and you will know the answer. Anything else is guessing. – Eric Lippert Mar 02 '16 at 20:26

2 Answers2

0

I would assume that the second version would perform better

If the multiple directories are still on the same physical drive you will likely get worse performance with the 2nd option.

There are some edge cases where writing in parallel (and limiting yourself to only 2 or 3 threads) can be faster. For example writing 1000's of 1kb files would perform better in a slightly parallel fasion due to the overhead costs of creating a file outweighing the IO costs of writing to the file. But if you where writing 1000's of 1mb files then having a single thread doing the writing would likely be faster.

A easy way to implement this is use TPL Dataflow, you can have a highly parallel TransformBlock but then have that connected to a 1 or 2 threaded ActionBlock which performs the writes. You then limit the input buffer of the ActionBlock when you set up the link and the TransformBlock will block producers if the pipeline is full without taking up a lot of memory.

Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • So sequential writing from a separate task running concurrently to the producers would be a good way to go? This would give the benefit of writing the files immediately and does not create as much overhead. –  Mar 02 '16 at 20:15
  • 1
    If I am understanding what you are saying, yes. A easy way to implement this is use [TPL Dataflow](https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx), you can have a highly parallel `TransformBlock` but then have that connected to a 1 or 2 threaded `ActionBlock` which performs the writes. You then limit the input buffer of the `ActionBlock` when you set up the link and the `TransformBlock` will block producers if the pipeline is full without taking up a lot of memory. – Scott Chamberlain Mar 02 '16 at 20:33
  • If you want to see a contrived example of this process in action see [this old answer](http://stackoverflow.com/questions/35558923/heavy-processing-stage-or-loop-thread/35562294#35562294) of mine, I pass in a file path, load a image asynchronously using a single thread, crop the image synchronously using 5 concurrent threads, then save the image asynchronously using a single thread. – Scott Chamberlain Mar 02 '16 at 20:43
  • Thanks! Does this also make sense if I don't use async/await (sorry if that is a newbie question)? –  Mar 02 '16 at 21:08
  • There is nothing wrong with not using it. I just use it because async/await makes sense with I/O. Just make sure you are using the syncronous methods, like using [`.Post(`](https://msdn.microsoft.com/en-us/library/mt604569(v=vs.111).aspx) instead of [`.SendAsync(`](https://msdn.microsoft.com/en-us/library/mt604550(v=vs.111).aspx), not just calling the asnyc methods with a `.Wait()` or `.Result` tacked on. – Scott Chamberlain Mar 02 '16 at 21:10
  • Yes of course. I think it would be good if you included the TPL Dataflow in your answer in case the comments get moved to chat in future. –  Mar 02 '16 at 21:15
0

I'm not sure of what you mean by your second task. I think you're talking about using a concurrent queue of some kind, and a consumer thread that services it. The producers write to that queue. The consumer thread waits for information to be added to the queue, and writes that information to disk. That way, the consumer can be writing to disk while producers are processing and adding things to the queue. There's no need to wait for all information to arrive.

I've had a lot of success using BlockingCollection for things like this.

If that's what you're talking about, then it should perform much better than your first option because, as you say, the disk I/O thread and the producer threads are executing concurrently.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Yes that is exactly what I mean, I would also use BlockingCollection. The answer of Scott Chamberlain made me think whether I should use a task for writing to disk and go through the data sequentially or use something like parallel.foreach within the task. –  Mar 02 '16 at 20:33
  • 2
    With I/O I find [TPL Dataflow](https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx) nice to work with, it lets you use async methods of I/O and handles all of the work of managing the `BlockingCollection` and the `Tasks` filling or empyting the collection. – Scott Chamberlain Mar 02 '16 at 20:36
  • @John: I agree with Scott's answer: it's probably not a good idea to have multiple threads doing the disk I/O. – Jim Mischel Mar 02 '16 at 20:56
  • @JimMischel Ok. Still wondering a little why many answers on SO suggest something similar to this http://stackoverflow.com/questions/8505815/how-to-properly-parallelise-job-heavily-relying-on-i-o , which, if I understood correctly, creates multiple tasks for I/O. –  Mar 02 '16 at 21:05