Multithreading With Very Large Number of Threads

Question

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.

Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.

Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.

What is a good approach to handling the threading so each node acts as if it is running on its own thread?

In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.

You should consider implementing some kind of (thread-)queueing or at least a dispatcher to manage the information flow. However it's hard to tell without knowing what you want to accomplish. — tamasgal, Sep 30 '13 at 09:22
Running 1600 threads will of course result in bad performance in regular computer. You might need to consider high-performance computer, or you can try to utilize your GPU for more computing power. GPU is good for multithreading. — justhalf, Sep 30 '13 at 09:27
@justhalf Of course it will be bad. Each thread is only active for a few milliseconds each second and sleeping the rest of the time, so I think the problem is not the CPU resources but rather either the number of cores or just the sheer existence of that many threads and the context switches they require. Instead of 1600 I should even say 10,000. I'm looking for a good solution for approximating these nodes running in parallel to maximize the number of nodes I can run. I don't think moving to the GPU will really help here. — Liron, Sep 30 '13 at 09:36
Approximating their full parallel characteristics. Running on a regular CPU leaves me unable to just give each node its own thread and assume they'll just all work hand in hand. For example, I could have one (or a few) control threads, which kick of smaller threads just when one of the nodes has information to send, but most of the time there isn't a thread for each node. — Liron, Sep 30 '13 at 09:40
Please read up on [GlobalInterpreterLock](https://wiki.python.org/moin/GlobalInterpreterLock). — Lasse V. Karlsen, Sep 30 '13 at 09:49
@LasseV.Karlsen - I assume that even using Jython enabling multiple nodes on the CPU wouldn't help me if the thread count was 10K or 100K. Probably a different architecture (and/or implementation language) is going to be necessary here. — Liron, Sep 30 '13 at 11:04

Olof Forshell · Answer 1 · 2013-09-30T09:54:19.307

1

To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.

A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.

When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.

I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.

edited Sep 30 '13 at 09:54

answered Sep 30 '13 at 09:47

Olof Forshell

3,169
22
28

C# might be worthwhile looking at too. It does threads properly (unlike Python) and is less of a shock to those used to the convenience of Python; C/C++ might prove to be too spartan for convenience. – bazza Sep 30 '13 at 18:17

score 1 · Accepted Answer · edited May 23 '17 at 11:57

1

For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.

You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.

edited May 23 '17 at 11:57

Community

1
1

answered Sep 30 '13 at 10:00

Ivan Voras

1,895
1
13
20

I"m using the python that comes with Canopy (enthought). I assume that is regular CPython. – Liron Sep 30 '13 at 10:59
1

From a bit of Googling, it looks like it is. But as others have also suggested, your problem would better be solved by using a different application architecture. – Ivan Voras Oct 01 '13 at 11:39
I switched the architecture to use gevents, with each node running on its own greenlet. For my application, this actually works really well, because I don't need actual parallel functionality, and concurrency is enough. (Basically the Master's greenlet just sets a flag for when the Master is actively receiving connections, and the Slaves query that value on the Master directly.) Thanks for the idea. – Liron Oct 01 '13 at 16:11

Ellioh · Answer 3 · 2014-01-04T19:22:45.167

Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.

If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.

Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/

My goal here is not to increase throughput with multiple threads, but to simulate inter-node communication. There is very little processing on each node. I'll take a look at erlang and see how it looks. — Liron, Sep 30 '13 at 10:55

score 0 · Answer 4 · answered Sep 30 '13 at 19:46

Some random thoughts here:

I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)

In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)

One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.

Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.

It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Multithreading With Very Large Number of Threads

4 Answers4