22

I have a piece of C# 5.0 code that generates a ton of network and disk I/O. I need to run multiple copies of this code in parallel. Which of the following technologies is likely to give me the best performance:

  • async methods with await

  • directly use Task from TPL

  • the TPL Dataflow nuget

  • Reactive Extensions

I'm not very good at this parallel stuff, but if using a lower lever, like say Thread, can give me a lot better performance I'd consider that too.

Eliezer Kohen
  • 469
  • 1
  • 4
  • 9

3 Answers3

66

This is like trying to optimize the length of your transatlantic flight by asking the quickest method to remove your seatbelt.

Ok, some real advice, since I was kind of a jerk

Let's give a helpful answer. Think of performance as in "Classes" of activities - each one is an order of magnitude slower (at least!):

  1. Only accessing the CPU, very little memory usage (i.e. rendering very simple graphics to a very fast GPU, or calculating digits of Pi)
  2. Only accessing CPU and in-memory things, nothing on disk (i.e. a well-written game)
  3. Accessing the disk
  4. Accessing the network.

If you do even one of activity #3, there's no point in doing optimizations typical to activities #1 and #2 like optimizing threading libraries - they're completely overshadowed by the disk hit. Same for CPU tricks - if you're constantly incurring L2/L3 cache misses, sparing a few CPU cycles by hand-writing assembly isn't worth it (which is why things like loop unrolling are usually a bad idea these days).

So, what can we derive from this? There are two ways to make your program faster, either move up from #3 to #2 (which isn't often possible, depending on what you're doing), or by doing less I/O. I/O and network speed is the rate-limiting factor in most modern applications, and that's what you should be trying to optimize.

Ana Betts
  • 73,868
  • 16
  • 141
  • 209
  • 8
    Another way to make the program faster is to perform the IO in a *smarter* way. For example, with classical HDDs, it's often faster to *not* perform IO in parallel, because it leads to more seeking (which is slow). – svick Apr 17 '13 at 10:28
  • 4
    ^^ This is a good idea too. It's fairly easy to detect an SSD via measuring the speed of a sequential read, measuring the speed of reading random sectors on disk, then comparing the variance. If they're similar, you've got an SSD – Ana Betts Apr 21 '13 at 22:40
  • 3
    @svick I think this is wrong as parallel IO is faster, because of the use of SCAN and elevator algorithms will actually be able to schedule better the reads – yoel halb Jun 16 '14 at 17:40
  • I think you are incorrect, because two explicit sets of sequential reads to disparate places will always have less seeks than parallel access to both places, even if you're rearranging I/O – Ana Betts Jun 16 '14 at 22:21
  • 1
    Brilliant article. I can also add that hardware has a big effect on #3. A standard spinning HD might do 100 IOPS (I/O per second), a mid range SSD might do 6,000 IOPS, and a high end PCI Express based SSD might do somewhere between 100,000 IOPS and 10 million IOPS. And readers 20 years in the future will laugh patronisingly at these numbers. – Contango Oct 29 '15 at 15:04
25

Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".

A better question to ask is "which option is easiest to learn and develop with?" Or "which option would be best to maintain this code with five years from now?" And for that I would suggest async first, or Dataflow or Rx if your logic is better represented as a stream.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
16

It's an older question, but for anyone reading this...

It depends. If you try to saturate 1Gbps link with 50B messages, you will be CPU bound even with simple non-blocking send over raw sockets. If, on the other hand, you are happy with 1Mbps throughput or your messages are larger than 10KB, any of these frameworks will do the job.

For low-bandwidth situations, I would recommend to prioritize by ease of use, i.e. async/await, Dataflow, Rx, TPL in this order. Note that high-bandwidth application should be prototyped as if it is low-bandwidth and optimized later.

For true high-bandwidth application, I can recommend Dataflow over Rx, because Rx is not designed for high concurrency. Raw TPL is the bottom layer, which guarantees the lowest overhead if you can handle the complexity. If you can make efficient use of dedicated threads, then that would be even faster. Async/await vs. Dataflow IMO doesn't make any performance difference. The overhead seems comparable, so choose one that's a better fit.

Robert Važan
  • 3,399
  • 2
  • 25
  • 31