Why not to use massively multi-threaded code?

Question

Asynchronous and other event-based programming paradigms seem to be spreading like wildfire these days, with the popularity of node.js, Python 3.5's recent async improvements, and what not else.

Not that I particularly mind this or that I haven't already been doing it for a long time myself, but I've been trying to wrap my head around the real reasons why. Searching around for the evils of synchronous programming consistently seems to net the preconceived notion that "you can't have a thread for each request", without really qualifying that statement.

Why not, though? A thread might not be the cheapest resource one could think of, but it hardly seems "expensive". On 64-bit machines, we have more than enough virtual address space to handle all the threads we could ever want, and, unless your call chains are fairly deep, each thread shouldn't necessarily have to require more physical RAM than a single page* for stack plus whatever little overhead the kernel and libc need. As for performance, my own casual testing shows that Linux can handle well over 100,000 thread creations and tear-downs per second on a single CPU, which can hardly be a bottleneck.

That being said, it's not like I think event-based programming is all just a ruse, seeing as how it seems to have been the primary driver allowing such HTTP servers as lighttpd/nginx/whatever to overtake Apache in highly concurrent performance**. However, I've been trying to find some kind of actual inquiry into the reason why massively-multithreaded programs are slower without being able to find any.

So then, why is this?

*My testing seems to show that each thread actually requires two pages. Perhaps there's some dirtying of the TLS going on or something, but nevertheless it doesn't seem to change a lot.

**Though it should also be said that Apache, at that time, was using process-based concurrency rather than thread-based, which obviously makes a lot of difference.

Possible duplicate of http://stackoverflow.com/questions/9964899/why-are-os-threads-considered-expensive — JimN, Oct 24 '15 at 00:39
1) Memory limitations, on windows for example, 1MB of stack needed for each thread, do you really want to thrash on the pagefile just to run each thread... and 2) the overhead of context switching might put a limit on your overall application performance, but only you can determine that. — Chris O, Oct 24 '15 at 00:39
@ChrisO: Admittedly I don't know about Windows, but all POSIX systems I know of only allocate virtual address space for the stack, and allocate physical pages lazily, as I described in the question. As for performance, the main reason to use event-based programming is to handle requests that do blocking requests anyway, so CPU bottlenecking doesn't seem to be the problem that is being solved. — Dolda2000, Oct 24 '15 at 00:41
So why would each thread be limited to just 1 page of memory? Are they doing nothing? — Chris O, Oct 24 '15 at 00:43
@ChrisO: They aren't "limited" to 1 page of stack. I was just saying that most things probably don't use more than that anyway, unless their call-chains are relatively deep. — Dolda2000, Oct 24 '15 at 00:44
1MB of stack is not NEEDED for each thread. 1MB is typically the default maximum stack limit specified by the linker and stored in the executable header for later use by the process loader. If the process thread/s only need, say, 64k of stack as an estimated max, then setting the stack limit at 128k in the linker allows 10 times more threads. Then, of course, as already said, stack is paged, so the scope for stack-thrashing is reduced if the used stack pages do not change much. — Martin James, Oct 24 '15 at 01:17
@MartinJames Not true, on Windows at least, for each thread created, the VAS of your process gets dinged by 1MB. — Chris O, Oct 24 '15 at 02:22
@ChrisO: True enough, but on 64-bit CPUs, the virtual address-space is virtually (pun unintended) unlimited anyway. — Dolda2000, Oct 24 '15 at 02:23
Certainly, but you still might care about thrashing on the pagefile (or not), depending on your app requirements. — Chris O, Oct 24 '15 at 02:25
@ChrisO: But virtual address-space does not affect the pagefile. — Dolda2000, Oct 24 '15 at 02:26
But if you allocate and/or touch memory, it will be in RAM, which is a major limiting factor. If enough things are doing that, then everybody gets to take turns getting paged out and in. 64-bit processes can allocate to their hearts content, but everybody must still fight for the limited resource, physical RAM. — Chris O, Oct 24 '15 at 02:30
@ChrisO: Yes, but the whole point is that you don't normally touch the stack that you don't use. — Dolda2000, Oct 24 '15 at 02:32
@ChrisO - not in my experience. On 32-bit Windows systems where the VAS is 2G, the number of threads I can raise is ~inversely propotional to the linker stack limit, (unless recent Win versions have changed in this respect - was tried on XP). On 64-bit systems, the non-paged pool size seems to apply some constraints, (not sure why). — Martin James, Oct 24 '15 at 05:52

score 2 · Answer 1 · answered Oct 24 '15 at 00:42

2

If you have a thread for each request, then you can't do a little bit of work for each of 100 requests without switching contexts 100 times. While many things computers have to do have gotten faster over time, context switching is still expensive because it blows out the caches and modern systems are more dependent on these caches than ever.

answered Oct 24 '15 at 00:42

David Schwartz

179,497
17
214
278

Modern context switches don't blow out the caches, though. The OS may use parts of them for the scheduler and whatnot, of course, but any event-based system would also need parts of the caches for its book-keeping. Do you have a source that shows context-switches to be the problem? – Dolda2000 Oct 24 '15 at 00:43
I'd also add that it would seem that anything that does any meaningful amount of work would add a lot of more context-switches than those related to threading anyway, if only to open files, send network responses, and whatnot. – Dolda2000 Oct 24 '15 at 00:48
Honestly, I can't make enough sense of what you're saying to respond to whatever misconceptions you have. Why do you say context switches don't blow out caches? And why do you think sending network responses involves a context switch? I have no idea what "whatnot" is. Opening files might rarely involve a context switch, avoiding this cost is one of the reasons operating systems cache filesystem metadata. – David Schwartz Oct 24 '15 at 00:56
The only instance of context-switches blowing the caches that I know of are for old CPUs that use virtually-tagged caches (that therefore need emptying when the virtual address-space changes), and those haven't been popular in a long while. As for sending network responses, how do you do those other than calling into the kernel? – Dolda2000 Oct 24 '15 at 00:58
1

@Dolda2000 Oh, I see the problem. By "context switch", I mean what happens when you switch threads. You don't switch threads when you send a network response. In fact, a thread that calls into the network code a lot may well find the networking code hot in the cache on each call. The caches are blown out because they hold the stuff the other thread was doing, not the stuff this thread was doing. To get good performance, you want to keep one thread running as long as possible doing stuff that's as similar to what it was doing before as possible. – David Schwartz Oct 24 '15 at 00:59
Even so, a context-switch between threads isn't any different from a context-switch into the kernel as far as the CPU is concerned, is it? If anything, it should be quite benevolent, as it's still working in the same address-space. – Dolda2000 Oct 24 '15 at 01:00
1

It's completely different. When you switch between threads, the caches contain all the stuff the old thread was doing and do you no good. When a thread that does network I/O a lot calls into the kernel, the network I/O code will likely still be hot in the caches. These two cases are complete opposites. – David Schwartz Oct 24 '15 at 01:01
If the different threads are running the same code, they will certainly be sharing the I-cache and probably a lot of the data structures. If they are running different code, then the same cache pressure would seem to result from an event-based system as well. – Dolda2000 Oct 24 '15 at 01:02
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93222/discussion-between-david-schwartz-and-dolda2000). – David Schwartz Oct 24 '15 at 01:03
2

I doubt that this difference of opinion can be settled without numbers. The FUD level on multiple threads is, in general, already far too high for a meaningful discourse:( – Martin James Oct 24 '15 at 01:30
@MartinJames: I agree. I, too, am mostly just speculating from theoretical points. I don't exclude a priori that this could be an issue, but I would indeed like to see some actual numbers confirming that. – Dolda2000 Oct 24 '15 at 01:36
2

@Dolda2000 for apps where the threads do not cycle through massive L1++ arrays and so cause thrashing on shared buses, I already have numbers that indicate that, for CPU-intensive tasks, my 4/8 core i7 peaks out in performance with ~32 threads, NOT the 8 that is typically suggested as the optimum. I gotta have numbers to convince me of anything re. multithread, or otherwise, overall performance. OTOH, if large arrays are sequentially accessed, the perf. of just 16 threads is very bad. Gotta have numbers from real testing! – Martin James Oct 24 '15 at 01:46

rcgldr · Answer 2 · 2015-10-24T05:12:51.540

0

Once the number of ready or running threads (versus threads pending on events) and/or processes goes beyond the number of cores, then those threads and/or processes are competing for the same cores, same cache, and the same memory bus.

Unless there are a massive number of simultaneous events to pend on, I don't see the purpose of massively multi-threaded code, except for super computers with a large number of processors and cores, and that code is usually massively multi-processing, with multiple memory buses.

edited Oct 24 '15 at 05:12

answered Oct 24 '15 at 00:53

rcgldr

27,407
3
36
61

While this is true, it doesn't seem to me to be pertinent to the question. The problem that is being solved by event-based programming isn't CPU bottlenecking, but handling I/O blocking within the processing of a certain request. Also, even event-based-coded programs usually distribute the work that is being done over a thread-pool as large at least as the number of CPUs in the system, so it seems the contention would be the same either way. – Dolda2000 Oct 24 '15 at 00:55
No, You mean 'Once the number of READY or RUNNING threads and/or processes are competing for the same cores and the number of RUNNING threads and/or processes.are competing for the same L2 and higher caches, and memory buses'. – Martin James Oct 24 '15 at 01:25

score 0 · Answer 3 · answered Oct 24 '15 at 01:28

That is a loaded question. I've heard different responses over time because I've had that conversation so many times before with different developers. Mainly, my gut feeling is most developers hate it because it is harder to write multi-threaded code and sometimes it is easy to shoot yourself in the foot unnecessarily. That said, each situation is different. Some programs lend themselves to multi-threading rather nicely, like a webserver. Each thread can take a request and essentially processes it without needing much outside resources. It has a set of procedures to apply on a request to decide how to process it. It decides what to do with it and passes it off. So it is fairly independent and can operate in its own world fairly safely. So it is a nice thread.

Other situations might not lend themselves so nicely. Especially when you need shared resources. Things can get hairy fast. Even if you do what seems like perfect context switching, you might still get race conditions. Then the nightmares begin. This is seen quite often in huge monolithic applications where they opted to use threads and open the gates of hell upon their dev team.

In the end, I think we will probably not see more threading in the day-to-day development, but we will move to a more event driven like world. We are going down that route with web development with the emergence of micro-services. So there will probably be more threading used, but not in a way that is visible to the developer using the framework. It will just be apart of the framework. At least that is my opinion.

'it is harder to write multi-threaded code' - well, even that is debateable. Async code requires a user-space state machine whereas the thread-per-connection model uses a kernel-space state-machine. Result - thread-per-connection user code is straightforward in-line, whereas async code is multiple callbacks. — Martin James, Oct 24 '15 at 01:37
I agree that it is a loaded question, and I most definitely agree that certain programs lend themselves more straightforwardly to one model or the other. Nevertheless, the performance question should be possible to answer objectively in itself. — Dolda2000, Oct 24 '15 at 01:39
Hahaha, that is just funny. @martin, I think you misunderstood what I meant by multi-threaded code. I meant code where you specifically have to claim a mutex and release it on critical sections of code. I'm thinking C++ or JAVA. Where people tend to release the mutex too early, or too late, or have to worry about race conditions. There is a lot more to consider when the logic could be fairly simple if the code was not multi-threaded. By harder, I meant adding the fear of deadlocks and crashes. — Bassel Samman, Oct 24 '15 at 01:47
I think a better question might be, how can we design systems that better lend themselves to having small independent components that can run asynchronously of each other? To that, the response might be having some sort of event based systems or micro-services environment on the large scale apps. So in a way, we are moving in that direction, but I agree, we are not there yet. — Bassel Samman, Oct 24 '15 at 01:54

Why not to use massively multi-threaded code?

3 Answers3