Will creating more threads than available processors have performance overhead?

Question

My goal is to handle WebSocket connections inside threads. If I use in a new Thread, the number of WebSocket connections that the server can handle is unknown. If I use in a Thread pool, the number of WebSocket connections that the server can handle is the thread pool size.

I am not sure about the correlation between available processors and threads. Does 1 processor execute 1 thread at a time?

My expected result: Creating more threads than the available processors is not advisable and you should re-design how you handle the WebSocket connections.

in a new Thread

final Socket socket = serverSocket.accept();

new Thread(new WebSocket(socket, listener)).start();

in a Thread pool

final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

final Socket socket = serverSocket.accept();

es.execute(new WebSocket(socket, listener));

To avoid confusion, the WebSocket class is a custom class that implements Runnable. As far I know, Java SE does not have a WebSocket server, only a WebSocket client.

https://stackoverflow.com/a/34689857/2478398 Give this answer and the one below it a read. But in effect ‘it depends’, and on a lot of things such as how CPU intensive the tasks you’re giving each thread are. — BeUndead, Aug 16 '21 at 00:49
What exactly do you mean by "handle", and what environment are you running your sockets in? Is there a specific reason you're not using an existing platform (such as Spring) to handle these matters for you automatically? — chrylis -cautiouslyoptimistic-, Aug 16 '21 at 00:49
@BeUndead assuming the concurrency concept and clever hardware design, `Executors.newFixedThreadPool(2000);` will not cause problem because 1. The OS will delegate each thread at a time `Thread A - > Thread B - > Thread A` and 2. CPU nowadays capable to execute 2 threads at the same time, assuming I have a 4 cores, the best thread pool size is 8 threads, I just have to redesign how to handle the WebSocket connections `e.g. 2000 keep alive connections` in 8 threads. Am I right? — Jason Rich Darmawan, Aug 16 '21 at 01:01
Ideally, make a threadpool (of whatever size your regular traffic is) and allow it to grow (and preferably shrink) to handle spikes, and let java+OS worry about the rest... But even more ideally, don’t do this yourself, use a library like @chrylis recommended above. These are ludicrously complicated topics, which take people years to figure out (I certainly haven’t) and even more years to implement well. — BeUndead, Aug 16 '21 at 01:04
@chrylis-cautiouslyoptimistic- "handle" in this case, how do the server receive incoming packet data and sending packet data, currently I put each WebSocket connection in 1 unique thread, but this mean I have to create thread pool of 100 threads to handle 100 alive connection unlike HTTP request/response. Environment? I still use it locally on my laptop, I have not push it, probably using graalvm. I have not finished with thr SSL context yet. — Jason Rich Darmawan, Aug 16 '21 at 01:13
@chrylis-cautiouslyoptimistic- For the specific reason, I dont know how Spring dependency injection works, it's like a Black Box, use '@SpringApplication` and couple of this that, suddenly you have websocket endpoint. But when I found a bug, reading the document is a hell to went through and when you want a specific feature, I am not sure how to implement it because Spring Framework actually dont tell you which class does that specific annotation use e.g. `@EnableWebSocketMessageBroker`. I spent weeks to debug and read the Spring Framework without progress. — Jason Rich Darmawan, Aug 16 '21 at 01:13
@BeUndead thanks, i will look up to the thread pool design, i think implementing your idea is a more viable task todo since resedesign the rfc6455 does take a week or so. Regarding the Spring Framework, do you have good tutorial on how to see Spring annotation works e.g. `@EnableWebSocketMessageBroker` which class the annotation create, so I can be better on how the server works step by step. — Jason Rich Darmawan, Aug 16 '21 at 01:21

score 3 · Accepted Answer · answered Aug 16 '21 at 01:26

Make threads. A thousand if you want.

At the CPU core level, here's what's happening:

The CPU core is chugging along, doing work for a given websocket.
Pretty soon the core runs into a road block: Half of an incoming bunch of data has arrived, the rest is still making its way down the network cable, and thus the CPU can't continue until it arrives. Alternatively, the code that the CPU core is running is sending data out, but the network card's buffer is full, so now the CPU core has to wait for that network card to find its way to sending another packet down the cable before there's room.
Of course, if there's work to do (say, you have 10 cores in the box, and 15 web users are simultaneously connected, that leaves at least 5 users of your web site waiting around right now) - then the CPU should not just start twiddling its thumbs. It should go do something.
In practice, then, there's a whole boatload of memory that WAS relevant that no longer is (all that memory that contained all that state and other 'working items' that was neccessary to do the work for the websocket that we were working on, but which is currently 'blocked' by the network), and a whole bunch of memory that wasn't relevant that now becomes relevant (All the state and working memory of a websocket connection that was earlier put in the 'have yourself a bit of a timeout and wait around for the network packet to arrive' - for which the network packet has since arrived, so if a CPU core is free to do work, it can now go do work).
This is called a 'context switch', and it is ridiculously expensive, 500+ cycles worth. It is also completely unavoidable. You have to make the context switch. You can't avoid it. That means a cost is paid, and about 500 cycles worth just go down the toilet. It's what it is.

The thing is, there are 2 ways to pay that cost: You can switch to another thread, which is all sorts of context switching. Or, you have a single thread running so-called 'async' code that manages all this stuff itself and hops to another job to do, but then there's still a context switch.

Specifically, CPUs can't interact with memory at all anymore these days and haven't for the past decade. They can only interact with a CPU cache page. machine code is actually not really 'run directly' anymore, instead there's a level below that where a CPU notices it's about to run an instruction that touches some memory and will then map that memory command (after all, CPUs can no longer interact with it at all, memory is far too slow to wait for it) to the right spot in the cache. It'll also notice if the memory you're trying to access with your machinecode isn't in a cache page associated with that core at all, in which case it'll fire a page miss interrupt which causes the memory subsystem of your CPU/memory bus to 'evict a page' (write all back out to main memory) and then load in the right page, and only then does the CPU continue.

This all happens 'under the hood', you don't have to write code to switch pages, the CPU manages it automatically. But it's a heavy cost. Not quite as heavy as a thread switch but almost as heavy.

CONCLUSION: Threads are good, have many of them. It ensures CPUs won't twiddle their thumbs when there is work to do. Note that there are MANY blog posts that extoll the virtues of async, claiming that threads 'do not scale'. They are wrong. Threads scale fine, and async code also pays the cost of context switching, all the time.

In case you weren't aware, 'async code' is code that tries to never sleep (never do something that would ever wait. So, instead of writing 'getMeTheNextBlockOfBytesFromTheNetworkCard', you'd write: "onceBytesAreAvailableRunThis(code goes here)`). Writing async code in java is possible but incredibly difficult compared to using threads.

Even in the extremely rare cases where async code would be a significant win, Project Loom is close to completion which will grant java the ability to have thread-like things that you can manually manage (so-called fibers). That is the route the OpenJDK has chosen for this. In that sense, even if you think async is the answer, no it's not. Wait for Project Loom to complete, instead. If you want to read more, read What color is your function?, and callback hell. Neither post is java-specific but covers some of the more serious problems inherent in async.

Does this mean `CompletableFuture.runAsync( () - > { WebSocket websocket = new WebSocket(socket, listener); es.execute(websocket); });`, I use this after `final Socket socket = serverSocket.accept();` in case the server is under DDoS with very long http message body during the handshake, so other client trying to connect will not be affectted. Does using CompletableFuture will make the Java application suffers from context switching cost? And, could you please explain about the magnitude of the `500+ cycles worth`? — Jason Rich Darmawan, Aug 16 '21 at 02:11
async makes it much _easier_ to DDOS a server, not harder. That runAsync trick does mostly nothing. It certainly doesn't prevent DDoS attacks. You can't avoid the context switching. Period. Nothing can. — rzwitserloot, Aug 16 '21 at 02:18
500+ cycles, as in: The CPU could be doing 500 simple calculations. __But it is what it is__, you cannot avoid this cost no matter how hard you try. runAsync won't do it either. The point is: If you read a blogpost that says async avoids it, they are misunderstanding how CPUs work. — rzwitserloot, Aug 16 '21 at 02:18
The reason async makes it easier: If you block _anywhere_ in an async block, that means the thread is twiddling its thumbs. If I figure out you forgot (e.g. you do a sync DB call somewhere), I just need to get into the 'block' mode on 10 simultaneous connections and your server's CPU is 100% frozen out, not responding to anything else. It is much, MUCH harder to loop a thread using a malicious request, and it's much easier to realize that's happening (as CPU load would jump to 100%). — rzwitserloot, Aug 16 '21 at 02:20
Also, I think you're lacking a sense of perspective. Your CPU can run literally billions of ops a second. That 500 feels expensive, but in context, we're talking about a nanosecond. Just write the simplest code you can, make lots of threads, call blocking code, don't worry about accidentally blocking, and worry about inefficient algorithms instead. You can handle hundreds of simultaneous connections on the cheapest AWS virtual server that way, no problem. — rzwitserloot, Aug 16 '21 at 02:23
That makes so much sense! I only use that `CompletableFuture` there, assuming there are only 4 threads from the ForkJoinPool.commonPool, I can imagine the Opening Handshake will do Redis select to determine if it's a valid request and it will block the async code. — Jason Rich Darmawan, Aug 16 '21 at 02:30
Thanks, I will avoid the CompletableFuture for now, I never a fan of callback anyway. I am not a CS grad, so I tried my best to understand the server-client. Thanks for the clarification. — Jason Rich Darmawan, Aug 16 '21 at 02:33
The cost of a context switch is not only the context switch itself, but also running into cold caches. And depending on the microarchitecture and the OS, a context switch can also lead to TLB evictions to prevent security problem as those caused by meltdown and others. — pveentjer, Aug 16 '21 at 08:33

Will creating more threads than available processors have performance overhead?

1 Answers1