Performance of Multi-Threaded Client Network Application

Question

I have implemented a client application that sends requests to a server. The way it works can be described very simply. I specify a number of threads. Each of this threads repeatedly sends requests to a server and waits for the answer.

I have plotted the total throughput of the client, for a various number of threads. The number of virtual clients is not important, I am interested by the maximal, saturated performance, at the very right of the graph.

I am surprised because I did not expect the performance to scale with the number of threads. Indeed, most of the processor time is spent in blocking i/o in Java (blocking sockets), as the client-server communication has a 1ms latency, and the client is running on a 8 core machine.

I have looked for solutions online, this answer on Quora seems to imply that the waiting time for blocking i/o can be scheduled to use for other tasks. Is is true, specifically for Java blocking sockets ? In that case, why don't I get linear scaling with the number of threads ?

If this important, I am running this application in the cloud. Also, this is part of a larger application, but I have identified this component as the bottleneck of the whole setup.

Using a Thread per Connection model does not scale well as it creates large stacks over all, and the context switching on the CPU can start to cause large delays. You can use a single thread with Non-Blocking IO for all the waits which should be able to hand 10's of thousands of connections. I'm hesitant to recommend a specific article, but if you need it I will. The reason it evens out is due to the amount of switching and how little work can be done during each switch. — Andrew T Finnell, Dec 03 '18 at 22:31
I realize now that blocking i/o was not the optimal choice in my situation. I did not expect the context switch to cause such an overhead, and my plan was to increase the number of threads. However, this is part of a university course where understanding the performance is the most important, so I will not change the implementation for now. This suggestion will be helpful for next time. Thanks. — Raphael D., Dec 03 '18 at 23:15
I suppose each thread uses its own connection, and on the server end each connection is served by dedicated thread. So the more number of client threads, the more number of server threads, and hence the throughput increases. — Alexei Kaigorodov, Dec 04 '18 at 11:07
@AlexeiKaigorodov Quite a large number of server implementations do not use a thread-per-connection model as it quickly falls apart when connections get to the 10s of thousands. Perhaps in this case it is a thread-per-connection on both, but as stated by others, it can be much more performant to use non-blocking i/o and do all of it in a single thread. — Andrew T Finnell, Dec 04 '18 at 18:27
@AndrewTFinnell it does not matter if the server use a thread-per-connection model or handles requests asynchronously with non-blocking I/O. In both cases, request are processed in parallel and this is what the client can see. — Alexei Kaigorodov, Dec 05 '18 at 04:22

Malt · Accepted Answer · 2018-12-08T13:14:05.417

I have looked for solutions online, this answer on Quora seems to imply that the waiting time for blocking i/o can be scheduled to use for other tasks. Is is true, specifically for Java blocking sockets ?

Regular Java threads map to OS-level threads one-to-one. They're equivalent. So yes, it's true of Java, and in fact every other language. Unless it's using Green Threads or non-blocking IO.

In that case, why don't I get linear scaling with the number of threads ?

Think about what you're doing from the perspective of the CPU. The CPU performs a costly context switch and allows some thread to run. That thread uses the CPU for a very short duration to prepare a network call, and then it blocks for a long time (milliseconds are quite a lot for a CPU working at 3GHz).

So each thread is doing only a tiny bit of work before another context switch is required. That means that a lot of the CPU's time is wasted on context switches instead of doing useful work.

Contrast that with a thread that's doing a CPU-bound task. The context switch takes the same time. But when a CPU-bound task is allowed to run, it manages to utilize the CPU for a long time, making the context-switch cheaper by comparison. This increases the overall CPU utilization.

So on one hand, you see higher rates with every new thread because you're essentially performing more concurrent I/O operations. On the other hand, every new thread adds a cost. So the marginal benefit of each additional thread is a bit smaller every time. If you keep adding threads, at some point you'll even reach a point where the rate will fall with each new thread.

This is very helpful. As a sanity check for the context-switching hypothesis, do you have any idea about the order of magnitude of a context-switch delay ? For it to make a difference here, I assume that it would have to be at least 0.1 msec. Also, I would be really grateful if you could point me to any monitoring tool to evidence that effect. — Raphael D., Dec 03 '18 at 23:22
Answers here https://stackoverflow.com/questions/21887797/what-is-the-overhead-of-a-context-switch would indicate 0.01 msec more or less an order of magnitude. That seems reasonable. — Raphael D., Dec 03 '18 at 23:31
That number heavily depends on the hardware and the the thread's working sets in cache: https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html — Malt, Dec 04 '18 at 08:46

Performance of Multi-Threaded Client Network Application

1 Answers1