Best concurrency framework for low latency, high throughput data transfer on single machine

Question

I am looking for ideas how a concurrent framework might be implemented for my specific architecture, using C#:

I implemented several modules/containers (implemented as classes) that are all individually to connect to a message bus. Each module either mainly produces or mainly consumes, but all modules also implement a request/reply pattern for communication between two given modules. I am very new to concurrent and asynchronous programming but essentially want to run the whole architecture in a concurrent way rather than synchronously. I would really appreciate some pointers which technology (TPL, ThreadPool, CTP, open source libraries,..) to consider for my specific use case, given the following requirements:

The whole system only runs on a local machine (in-process, even the message bus)
At least one module performs heavy IO (several million 16byte messages per second reads from physical drive), publishing multiple 16byte chunks to a blocking collection throughout the whole time.
Another modules consumes from the blocking collection throughout the whole time.
The entry point is the producer starting to publish messages, exit when the producer finishes publishing a finite set of 16byte messages.
The only communication that circumvents the message bus is the publishing/consuming to/from the blocking collection for throughput and latency reasons. (Am happy to hear suggestions to get rid of the message bus if it is plausible)
Other modules handle operations such as writing to an SQL database, publishing to a GUI server, connecting to APIs that communicate with outside servers.Such operations run less frequently/throttled and could potentially be run as tasks rather than utilizing a whole thread throughout running the system.
I run on a 64bit, quad core, 16gb memory machine but ideally I would like to implement a solution that can also run on a duo core machine.

Given what I like to manage what concurrency implementation would you suggest I should focus on?

EDIT: I like to emphasize that the biggest problem I am facing is how to conveniently hook up each container/module to a thread/task pool so that each of the modules runs async while still providing full in and out communication between such modules. I am not too concerned with optimizing a single producer/consumer pattern before I have not solved hooking up all the modules to a concurrent platform that can handle the number of tasks/threads involved dynamically.

When you do read from disc your data you won`t get much faster than 20MB/s if you read string data (see http://stackoverflow.com/questions/7153315/how-to-parse-a-text-file-in-c-sharp-and-be-io-bound) from your requirement you want to read 16MB/s-48MB/s with .NET. This can be achieved but it is at the limit what you can get with one process. You are mainly GC limited in this scenario. I would switch to C++ or managed C++ to get the desired read performance without much GC overhead and the go back to .NET. — Alois Kraus, Mar 20 '12 at 13:04
Your architecture is still unclear. How is your message bus implemented? '16byte messages' - there is an issue. No matter how you implement your system, this message is too small for efficient inter-thread comms. Can you chunk them up and enqueue 4K blocks of messages? — Martin James, Mar 20 '12 at 13:13
@Alois, I strongly disagree with you. First, I pointed out I read byte arrays (16byte messages) and not string data. Secondly, I currently handle read throughput of 24 million 16byte messages per second unparsed, and 10 million/sec parsed on a single thread. Show me how you get faster than that with a C++ implementation and I happily port to C++. Caveat: I achieve such rates purely in-proc, it goes down to 10million messages/sec unparsed and 6 million including parsing when transporting over the message bus in multiple 16byte chunks. — Matt, Mar 20 '12 at 13:18
@Martin, yes sorry I will edit my post, I can easily chunk up the data and then publish to the collection. In fact I do chunk it up with my ZeroMQ implementation. — Matt, Mar 20 '12 at 13:19
@Alois, adding to above I even get similar results with random read access. I operate on very large binary files (5gb+) and implemented a binary search algorithm on the binary data that makes any mapping or lookup tables obsolete. Pure binary data, milliseconds lookup time to find start and end point of the to be read segment and essentially same throughput as if I read the file from beginning to end. This I strongly doubt can be further optimized in C++. — Matt, Mar 20 '12 at 13:32
@Freddy: 24million*16bytes/s = 386 MB/s. Are you using a RAID or a SSD? This is about the native performance the disc can give you as byte array but there is no way that you can parse such a stream and create objects from it in .NET at such a rate. If your implementation does stream the data directly to the consumer without allocating GB of scheduled messages to process by your receiver threads you can make it work. But then you are better off to synchronously process the messages to keep memory usage low. — Alois Kraus, Mar 20 '12 at 14:09
I run on a sata3 physical drive which easily handles the throughput. I ran a test on my new OCZ Vertex 3 Max IOPS drive and did not get much faster. I believe the reason is that in general SSDs are not that much faster in core throughput. Where they shine is their random access, meaning if you throw several threads at the IO you can get a lot more combined throughput than with a physical drive. I mentioned that I still get to process 10million messages/second including the parsing of the byte array in .Net -> several primitive variables. — Matt, Mar 20 '12 at 14:15

score 2 · Accepted Answer · answered Mar 22 '12 at 06:50

I found n-act http://code.google.com/p/n-act/ , an Actors framework for .Net which implements pretty much what I am looking for. I described in my question that I look for bigger picture framework suggestions and it looks to me that an Actor Framework solves what I need. I am not saying that the n-act library will be what I implement but it is a neat example of setting up actors that can communicate asynchronously and can run on their own threads. Message passing also supports the new C#5 async/await functionality.

Disruptor was mentioned above and also the TPL and couple other ideas and I appreciate the input, it actually really got me thinking and I spent quite a bit of time to understand what each library/framework attempts to target and what problems it tries to solve, so the input was very fruitful.

For my particular case, however, I think I believe the Actors Framework is exactly what I need because my main concern is the exchange of async data flow. Unfortunately I do not see much of the Actor model implemented in any .Net technology (yet). TPL Dataflow looks very promising but as Weismat pointed out it is not yet production ready.

If N-Act does not prove stable or usable then I will look for a custom implementation through the TPL. It's about time anyway to fully understand all that TPL has to offer and start thinking concurrently already at the design stage rather than trying to transfer synchronous models into an asynchronous framework.

In summary, "Actor Model" was what I was looking for.

I've been working with NAct for a couple weeks. I love the premise, though there do appear to be some holes in the implementation. For instance, I've found no out-of-the-box way for an actor to synch onto it's own proxy. In fact, the entire concept of multiple proxies doesn't make sense to me, so I created a base class that fetches a single proxy and provides a way to sync other threads onto itself. I'm struggling right now with trying to use NAct with Tasks; seems like there are problems there. Whatever happens, the approach is definitely the best; worst case I'll fork or make and alt. — N8allan, Aug 15 '13 at 18:59
Can't edit my message; too many minutes. :-( The problems with tasks were mine BTW. I should also clarify what's better about the approach: in NAct, you just invoke methods on a proxy rather than constructing message structures. This makes good sense given that what actors inevitably end up with is a big message processing loop with a switch statement in the middle. This avoids all that. I've coupled the Stateless state machine with N-act which seems like a great combo so far. — N8allan, Aug 15 '13 at 19:30
Sorry, but I ended up never using n-act so I am not aware where that project stands. I use tpl dataflow for inproc message passing and zeroMQ to send over the wire — Matt, Aug 16 '13 at 00:24

Gabe · Answer 2 · 2012-03-21T09:55:58.250

1

I recommend disruptor-net for a task like this, where you have high throughput, low latency, and a well-defined dataflow.

If you're willing to sacrifice some performance for some thread management, TPL Dataflow might work for you. It does a good job of using TPL for task scheduling.

edited Mar 21 '12 at 09:55

answered Mar 20 '12 at 12:52

Gabe

84,912
12
139
238

Gabe, thanks for the link I will look into it, sounds quite interesting but need to check what throughput the library supports. Just glancing over the link I saw 100k/sec message transfers mentioned which sounds quite promising as I can chunk up my messages. – Matt Mar 20 '12 at 13:22
@Freddy: I get over 1M messages/sec on my laptop, and that's with two threads communicating both directions (X sends to Y and then waits for Y to send a message back). Your setup should support 10M/sec easily. – Gabe Mar 20 '12 at 13:44
Watching the video right now ;-) Will get back after playing with the library. – Matt Mar 20 '12 at 13:55
Gabe, I digged a bit into disruptor. If I understand the pattern correctly then it optimizes message passing between producer and consumer. Using this library would I still sign responsible for managing threads/tasks if I wanted to hook up all the mentioned modules/containers? Again I understand this library more as an optimization between producer/consumer, rather than a library such as SmartThreadPool that handles inter thread communication between a deliberate number of producers and consumers in a single app. I admit I still am very new to concurrency and would appreciate any pointers... – Matt Mar 21 '12 at 04:48
@Freddy: Disruptor is definitely not a thread pool, so you would be responsible for some thread management. – Gabe Mar 21 '12 at 05:18
Gabe, thanks I just found out as well. I guess it won't solve my problem then. I have a sufficiently fast way to publish to and consume from a concurrent collection. What I look for is a library that makes thread/task management a blaze as I do not possess much concurrency experience. Any ideas or suggestions whether there are .Net libraries out there that may help? I came across Smart Thread Pool but seems slightly dated. – Matt Mar 21 '12 at 08:06
Gabe, a follow up question if I may: Can I utilize one single ring buffer in disruptor and still match up certain producers/consumers? For example, consumers 1 and 2 should only take byteArrays from producer1, while consumer 3 consumes byteArrays from producer2 and consumers 4 and 5 from producer3, only. Each producer produces byteArrays that deserialize into completely different objects. Or do I need to start up several disruptors to handle such scenario? – Matt Mar 21 '12 at 08:10
You can certainly use a single ring buffer, but since there's no overlap between consumers, I don't think it would provide any benefit. Also, see my edit about TPL Dataflow. – Gabe Mar 21 '12 at 09:57
Gabe, thanks I will check into that. I am really confused because MSFT comes out with a lot of new stuff and I dont yet fully understand each benefit and tradeoff between, TPL, ThreadPool, Rx, CPT, TPL DataFlow, Async/Await patterns...Is there a blog or writeup that compares all those technologies? – Matt Mar 21 '12 at 11:16
I've never even heard of CPT! – Gabe Mar 21 '12 at 11:35
sorry my wrong, Async CTP: http://msdn.microsoft.com/en-us/vstudio/gg316360 – Matt Mar 21 '12 at 12:04

score 0 · Answer 3 · answered Mar 21 '12 at 09:36

0

You may look into Concurrency and Coordination_Runtime as well if you are looking for a framework based concurrency solution. I think this might be a fit for your design ideas.
Otherwise I would follow the rule, that threads should be used when something will be running for the whole lifetime of your application and tasks for short-running items.
I believe it is more important that the responsibility for the concurency is clearly defined, so that you might change the framework later.
As usual for writing fast code, there are no rules of thumb, but th need of a lot of testing with small stubs with measuring the actual performance.

answered Mar 21 '12 at 09:36

weismat

7,195
3
43
58

This is by far the most complex threading library for .NET. When he fully understands this library he has mastered not only threading but asynchonous programming as well. Siemens uses this for the US postal service to route 100 million letters per day with this library. Not sure if this does scale to his performance requirements. – Alois Kraus Mar 21 '12 at 10:30
Weismat, thanks I will look into it, compex or not, I am looking for something that solves my problem and if it fits the bill then I will learn it. – Matt Mar 21 '12 at 11:17
Forgot one thing which is mentioned too rarely - use the server mode for the garbage collection. This will make a very significant performance difference in your usage scenario. The CPU load, but also the performance will increase dramatically. – weismat Mar 21 '12 at 11:58
how does it compare to the new TDF (TPL Dataflow) library? Conceptually my project heavily involves dataflow (communication between asynchronously running "agents"). I read the TDF whitepaper and they introduce a lot of communication patterns that really fit to what I attempt to do. However, I do not want to spend hours and days delving into different concepts only to find out they do not fit my purpose. Would you have an idea how TDF and the Concurrency and Coordination Runtime compare? – Matt Mar 21 '12 at 12:02
AFAIK TPL Dataflow is not officially production ready yet. – weismat Mar 21 '12 at 12:53
From the white paper about TLP Data flow: "TDF can be thought of as a logical evolution of the CCR for non-robotics scenarios, incorporating CCR’s key concepts into TPL and augmenting the enhanced model with core programming model aspects of the Asynchronous Agents Library". – weismat Mar 21 '12 at 13:15
Stephen Toub: ... TPL Dataflow is in part based on and inspired by concepts from CCR, along with concepts from Axum and Visual C++ 10's Asynchronous Agents library, so you'll see a lot of similarities in terms of the kinds of problems you can solve. The APIs were redesigned to fit in well with the rest of the .NET Framework and to take advantage of what the Task Parallel Library and other .NET goodies have to offer, as well as redesigned to incorporate some more scenarios and patterns we felt were important. ... from http://channel9.msdn.com/Shows/Going+Deep/Stephen-Toub-Inside-TPL-Dataflow – Alois Kraus Mar 21 '12 at 13:25
Weismat & Alois, I watched pretty much everything there is about TPL Dataflow, followed blogs, and read whitepapers. I came to one conclusion: While many pro developers seem to appreciate bits and pieces of the extensions they all seem to share one common issue: They seem very lost when it comes to understanding the use cases and which scenarios are suitable for which library/technology. While Toub is generally very clear in his presentations, imho he could not communicate well who should really use TPL Dataflow. Maybe just my wrong perception (but seems to me like a typical Microsoft issue) – Matt Mar 22 '12 at 04:46

Best concurrency framework for low latency, high throughput data transfer on single machine

3 Answers3