6

I have a C#.NET application that needs to inform anywhere from 4000 to 40,000 connected devices to perform a task all at once (or as close to simultaneous as possible).

The application works well; however, I am not satisfied with the performance. In a perfect world, as soon as I send the command I would like to see all of the devices respond simultaneously. Yet, there seems to be a delay as all the threads I have created spin up and perform the task.

I have used the .NET 4.0 ThreadPool, created my own solution using custom threads and I have even tweaked the existing ThreadPool to allow for more threads to be executed at once.

I still want better performance and that is why I am here. Any ideas? Comments? Suggestion? Thank you.

-Shaun

Let me add that the application notifies these 'connected devices' that they need to go listen for audio on a multicast address.

Shaun McDonnell
  • 451
  • 4
  • 10
  • 4
    what kind of network is this? is sending via UDP multicast an option? – NG. Oct 12 '10 at 21:26
  • 1
    Yes, telling thousands of devices to go listen for audio on a multicast address. – Shaun McDonnell Oct 12 '10 at 21:27
  • 2
    Do you have 4,000 to 40,000 CPUs/Cores? If you do, then you can execute all the threads simultaneously... but if you can multicast on UDP, then why bother with threads? – Kiril Oct 12 '10 at 21:43
  • 2
    To send a MultiCast packet you need only 1 thread. – H H Oct 12 '10 at 21:44
  • What I am doing is telling 4000 devices to go listen for audio on a multicast address... which is a little different. – Shaun McDonnell Oct 12 '10 at 21:46
  • 1
    You see this: http://stackoverflow.com/questions/145312/maximum-number-of-threads-in-a-net-app – Greg McNulty Oct 12 '10 at 22:12
  • 1
    Since you likely have a single (or just a few) network cards... the network requests will go out serially. I totally fail to see the point of threads. – darron Oct 12 '10 at 22:59
  • This really sounds like the wrong solution to a problem, unless you're for some reason testing 4000 instances of a specific device with a specific synchronous API you absolutely have to use. – darron Oct 12 '10 at 23:04
  • 5
    4k devices sounds like a big enough project to justify modifying the devices (or making custom ones) to listen to a single 'start audio' notification packet. Way, way easier. Even with perfect threading and perfect 'start' packets going out to the devices one after the other with no delay.. you're still talking about a lot of packets to send serially over the network and a substantial delay between the first and last devices' notifications (if audio synchronization is the main concern) – darron Oct 12 '10 at 23:10

7 Answers7

14

A dual-core hyperthreaded processor MAY be able to execute 4 threads simultaneously - depending on what the thread is doing (no contention on IO or memory access, etc). A quad-core hyperthread perhaps 8. But 40K just can't physically happen.

If you want near simultaneous, you're better off spinning up just as many threads as the computer has free cores and having each thread fire off notifications then end. You'll get rid of a bunch of context switching this way.

Or, look elsewhere. As SB recommended in the comments, use a UDP multicast to notify listening machines that they should do something.

Philip Rieck
  • 32,368
  • 11
  • 87
  • 99
12

You cannot execute 4000 threads simultaneously, let alone 40k. At best on a desktop box with hyperthreading, you might get up to 8 simultaneous processes going (this assumes quad core). Threads are pseudo-parallel, and that's not even digging into the issues of bus contention.

If you absolutely need simultaneity for 40k devices, you want some form of hardware synchronization.

Randolpho
  • 55,384
  • 17
  • 145
  • 179
  • 5
    And I'd be willing to bet any hardware synchronization system that can execute 40k nodes simultaneously is going to be uber-expensive. – Randolpho Oct 12 '10 at 21:28
  • 1
    Appreciate your response. I would like to think that this is possible; however, only because I believe I have seen some applications do it. That said, maybe it was hardware-based like you said. Thanks. – Shaun McDonnell Oct 12 '10 at 21:29
5

It sounds like you have some control over what software runs on each device. In which case, you could look to HPC usage and architect your devices (nodes) hierarchically and/or use MPI to execute your remote processes.

For the hierarchy example: Designate say, 8 nodes as primary masters, again with 8 slave nodes, each slave can act as a master too with 8 slaves (you might need to look at an automated subscription algorithm to do this). You will have a hierarchy 6 deep to cover 40,000 nodes. Each master has a small portion of code running continually waiting for instructions to pass to slaves.

All you then do is pass the instruction to the 8 primary masters and your instruction will be propagated to the ‘cluster’ on the wire asynchronously by the masters. The instruction only has to be passed on a maximum of 5 times, and thus will be propagated v-quickly.

Alternatively (or in conjunction) you could look at MPI, which is an out-of-the-can solution. There are some established C# implementations.

jnielsen
  • 192
  • 1
  • 4
4

The overhead of creating thousands of threads is (very) significant; I would seek an alternative solution. This sounds like a job for asynchronous IO: your computer presumably only has one network connection, so no more than one message can be sent at a time - threads cannot improve on this!

Rafe
  • 5,237
  • 3
  • 23
  • 26
3

Am I correct in guessing that you're using a synchronous API call on your device, which is why it must be executed in a thread? Does the API have an asynchronous version of the call? If the device API can really support 40k+ devices, then it should. It should also have internal handling of whatever wait handles (or equivalent) are required to synchronize the return data for callback. This isn't something you can handle at the client application side; you don't have enough visibility of the underlying implementation of the device API to know how to parallelize the tasks. As you've discovered, creating 40k threads with blocking calls doesn't cut it.

Dan Bryant
  • 27,329
  • 4
  • 56
  • 102
2

Always fun with these old ones.

1mb per thread means you need 4-40gb just in RAM minimum, and 4k-40k cores. and the fact that you have a network to send it on.

Means that it will be syncronized somewhere along the way, on the nearest switch/router (most of it probably even on you network card, if you even could get all the packages there at the same time, and it managed to send it without caching it or dying on you). Meaning simply all that work multi threading was for nothing as it will not reach the endpoints simultaneously.

Think of it as taking one 40'000 lane road and placing 40'000 cars on it, sure everyone get to the same point on the road at the same time, but then they leave the road and go home. Everyone gets home at different times, even if they started driving on the 40k road at the same point and time.

You just, can not, beat the physical realm (yet...).

Thomas Andreè Wang
  • 3,379
  • 6
  • 37
  • 53
2

You should do async IO to the devices. This is very efficient and uses a different (larger ) set of threads to handle some of the work. Certainly the devices will receive the commands much faster. The IO thread pool will handle the replies (if any)

pm100
  • 48,078
  • 23
  • 82
  • 145