TcpListener based application that does not scale up well

Question

I have an ECHO server application based on a TCPListener. It accepts clients, read the data, and returns the same data. I have developed it using the async/await approach, using the XXXAsync methods provided by the framework.

I have set performance counters to measure how many messages and bytes are in and out, and how many connected sockets.

I have created a test application that starts 1400 asynchronous TCPClient, and send a 1Kb message every 100-500ms. Clients have a random waiting start between 10-1000ms at the beginning, so they not try to connect all at the same time. I works well, I can see in the PerfMonitor the 1400 connected, sending messages at good rate. I run the client app from another computer. The server's CPU and memory usage are very little, it is a Intel Core i7 with 8Gb of RAM. The client seems more busy, it is an i5 with 4Gb of RAM, but still not even the 25%.

The problem is if I start another client application. Connections start to fail in the clients. I do not see a huge increase in the messages per second (a 20% increase more or less), but I see that the number of connected clients is just around 1900-2100, rather than the 2800 expected. Performance decreases a little, and the graph shows bigger variations between max and min messages per second than before.

Still, CPU usage is not even the 40% and memory usage is still little. I have tried to increase the number or pool threads in both client and server:

ThreadPool.SetMaxThreads(5000, 5000);
ThreadPool.SetMinThreads(2000, 2000);

In the server, the connections are accepted in a loop:

while(true)
{
    var client = await _server.AcceptTcpClientAsync();
    HandleClientAsync(client);
}

The HandleClientAsync function returns a Task, but as you see the loop does not wait for the handling, just continues to accept another client. That handling function is something like this:

public async Task HandleClientAsync(TcpClient client)
{    
    while(ws.Connected && !_cancellation.IsCancellationRequested)
    {
        var msg = await ReadMessageAsync(client);
        await WriteMessageAsync(client, msg);
    }
}

Those two functions only read and write the stream asynchronously.

I have seen I can start the TCPListener indicating a backlog amount, but what is the default value?

Why could be the reason why the app is not scaling up till it reaches the max CPU?

Which would be the approach and tools to find out what the actual problem is?

UPDATE

I have tried the Task.Yield and Task.Run approaches, and they didn't help.

It also happens with server and client running locally in the same computer. Incrementing the amount of clients or messages per second, actually reduces the service throughput. 600 clients sending a message each 100ms, generates more throughput than 1000 clients sending a message each 100ms.

The exceptions I see on the client when connecting more than ~2000 clients are two. With around 1500 I see the exceptions at the beginning but the clients finally connect. With more than 1500 I see lot of connection/disconnection :

"An existing connection was forcibly closed by the remote host" (System.Net.Sockets.SocketException) A System.Net.Sockets.SocketException was caught: "An existing connection was forcibly closed by the remote host"

"Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host." (System.IO.IOException) A System.IO.IOException was thrown: "Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host."

UPDATE 2

I have set up a very simple project with server and client using async/await and it scales as expected.

The project where I have the scalability problem is this WebSocket server, and even when it uses the same approach, apparently something is causing contention. There is a console application hosting the component, and a console application to generate load (although it requires at least Windows 8).

Please note that I am not asking for the answer to fix the problem directly, but for the techniques or approaches to find out what is causing that contention.

"Connections start to fail in the clients" What error do you get at what location? — usr, Feb 25 '14 at 11:44
I do not remember exactly, but something like "Unable to connect, connection refused" and "Unable to read from transport connection". — vtortola, Feb 25 '14 at 12:01
Well, please do find out! At them moment we just know "there was an error, somewhere". — usr, Feb 25 '14 at 12:07
Besides the exact error info, tell us if (and how) you explicitly use thread pool, with `Task.Run`, `Task.Factory.StartNew` etc. — noseratio, Feb 25 '14 at 12:18
@Noseratio, I do not call `Task.Run` (or `Task.Factory.StartNew`), I am realizing that I was wrongly assuming that `async` methods would run the returned `Task` in other thread pool thread, but it is not as explained here: http://blog.stephencleary.com/2012/02/async-and-await.html . I will change `HandleClientAsync(client)` for `Task.Run(HandleClientAsync(client))` and try again. — vtortola, Feb 25 '14 at 12:39
@vtortola, my point was to *not* use `Task.Run`. This is what I meant: http://stackoverflow.com/a/21018042/1768303. Maybe you should show what your `HandleClientAsync` looks like. — noseratio, Feb 25 '14 at 12:49
@Noseration I have updated the post. I will follow that example and check how it goes. — vtortola, Feb 25 '14 at 13:36
@vtortola: Are you ***absolutely*** sure you ***need*** to use TCP/IP? Because there are several problems with `HandleClientAsync`: it uses `Connected`, it reads without a simultaneous periodic write, and it writes without a simultaneous continuous read. TCP/IP is not unlike writing assembly language. In Klingon. Is there any possible way you could use WebAPI and/or SignalR instead? — Stephen Cleary, Feb 25 '14 at 13:47
@StephenCleary yes, absolutely sure. These are home projects I am doing to understand async/await and WebSockets, so it is just for fun. The read and write works with lines (\r\n), so it blocks till it gets a complete line or write a complete line, not sure if that is what you mean. — vtortola, Feb 25 '14 at 15:06
@vtortola: I strongly encourage you to choose another project to learn `async`/`await`. Learning TCP/IP is a monumental task in and of itself. And no, the blocking is not what I mean; with a read/write loop, you leave yourself open to the half-open problem. — Stephen Cleary, Feb 25 '14 at 15:11
@StephenCleary it is being funny so far, I may change my mind soon though haha. I have linked the code in the post. — vtortola, Feb 26 '14 at 12:40
@vtortola, is your test code a console app? Or anything else? — noseratio, Feb 26 '14 at 14:02
@vtortola I can't actually test it, since I'm not on Windows 8, but looking around, it might very well be that the async code isn't the culprit, and neither is TCP. Especially if the barebone TCP server works fine. In your code, I can see that you're doing operations that could potentially cause bad performance (`WebSocketFrameHeader` caught my eye in particular). However, this only makes it more important for you to profile the application. Going through byte arrays byte-by-byte, and without eliminating bounds checking, and xoring against another array... this could cause major cache misses. — Luaan, Feb 26 '14 at 14:10
@Luaan I will try to get a profiler and check. The XOR operation is required, but what bothers me is that the CPU is not fully used. If the CPU was fully used, I could assume the code is slow and start to optimize; but the low CPU and memory usage suggest me that it is a contention problem. — vtortola, Feb 26 '14 at 14:19
@vtortola I'm not sure, but it might be possible that CPU idling due to cache misses might not be reported as CPU usage. I really don't know, though. Also, using the Concurrency Visualizer in Visual Studio is awesome for finding contentions due to multi-threading (including the GC). If you see that you're blocking 75% of the time, you're closer to the solution yet again :) — Luaan, Feb 26 '14 at 14:22
I think it could be related with the synchronization in "WriteInternalAsync" in the "WebSocketClient" class, but the only concurrency there is when a "ping" is executed, and even with the ping disabled, the problem remains. — vtortola, Feb 26 '14 at 14:42
I have done some improvement. I have removed a lot of `async` code in a way that I only `await` when I am not sure when something is going to happen, but when I am sure it is happening I progress synchronously. For example, I await the header of a WebSocket frame, but once the header is there, I proceed synchronously to read the rest and send an answer. I improved almost a 200%, and now I can handle around 4000 concurrent clients. Still far from using the full hardware though. I keep looking into it ... — vtortola, Feb 27 '14 at 00:32

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

I have managed to scale up to 6,000 concurrent connections without problems and processing around 24,000 messages per second connecting from machine no machine (no localhost test) and using only around 80 physical threads.

There are some lessons I learnt:

Increasing the thread pool size made things worse

Do not do unless you know what you are doing.

Call Task.Run or yield with Task.Yield

To ensure you release the calling thread from attending the rest of the method.

ConfigureAwait(false)

From your executable application if you are confident you are not in a single threaded synchronization context, this allows any thread to pick up the continuation rather than wait specifically for the one that started to become free.

Byte[]

The memory profiler showed that the app was spending too much memory and time in creating Byte[] instances. So I designed several strategies to reuse the available ones, or just work "in place" rather than create new ones and copy. The GC performance counters (specifically "% time in GC", that was around 55%) raised the alarm that something was not right. Also, I was using BitArray instances to check bits in bytes, what caused some memory overhead as well, so I replace them with bit wise operations and it improved. Later on I discovered than WCF uses a Byte[] pool to cope with this problem.

Asynchronous does not mean `fast`

Asynchronous allows scale nicely, but it has a cost. Just because there is an available asynchronous operation does not mean you should use it. Use asynchronous programming when you presume it will take sometime waiting before getting the actual response. If you are sure the data is there or the response will be quick, proceed synchronously.

Support sync and async is tedious

You have to implement the methods twice, there is no bulletproof way of rehusing async from sync code.

+1 for the research, I'd be interested if you tried [this optimization](http://stackoverflow.com/a/22237307/1768303). — noseratio, Mar 06 '14 at 22:31

Luaan · Answer 2 · 2014-02-26T10:19:41.753

0

~~Well, for one, you're running everything on one thread, so changing the ThreadPool isn't going to make any difference.~~

EDIT: As Noseration pointed out, this is not actually true. While IOCP and the asynchronous socket itself doesn't actually require additional threads for I/O requests, the default implementation in .NET does. The completion event is processed on a ThreadPool thread, and it is your responsibility to either supply your own TaskScheduler, or queue the event and process it manually on a consumer thread. I'm going to leave the rest of the answer, because it's still relevant (and the thread switching isn't a performance issue here, as described later in the answer). Also note that the default TaskScheduler in an UI application usually does use a synchronization context, so in eg. winforms, the completion event would be processed on the UI thread. In any case, throwing more threads than CPU cores on the problem isn't going to help.

However, this isn't necessarily a bad thing. I/O bound operations don't benefit from being run on a separate thread, in fact, it's very inefficient to do so. That's exactly what async and IOCP is for, so keep using it.

If you're starting to get significant CPU usage, that's where you want to make things parallel, as opposed to simply asynchronous. Still, receiving the messages on one thread using await should be just fine. Handling multi-threading is always tricky, and there are lots of approaches for different situations. In practice, you usually don't want more threads than you have processor cores available - if they're competing for I/O, use async. If they're competing for CPU, that's only going to get worse with more threads than the CPU can process in parallel.

~~Note that since you're running on one thread, one of your processor cores might very well be running at 100%, while the rest do nothing. You can verify this in task manager easily.~~

Also, note that the amount of TCP connections you can have open at one time is very much limited. Each connection has to have its own ports on both the client and the server. The default values for client Windows are somewhere in the line of 1000-4000 ports for that. That's not a lot for a server (nor your load-testing clients).

If you open and close connections as well, this gets even worse, because TCP ports are guaranteed to be open for some time (up to four minutes after being disconnected). This is because opening a new TCP connection on the same port might mean that data for the old connection might arrive on the new connection, which would be very, very bad.

Please, add more information. What does ReadMessageAsync and WriteMessageAsync do? Is it possible that the performance impact is caused by GC? Have you tried profiling the CPU and memory? Are you sure you're not actually exhausting the network bandwidth with all those TCP messages? Have you checked if you're experiencing TCP port exhaustion, or high packet loss scenarios?

UPDATE: I've written a test server and client, and they can exhaust the available TCP ports in under a second, including all initializations, when using asynchronous sockets. I'm running this on localhost, so each client connection actually takes two ports (one for server, one for client), so it's somewhat faster than when the client is on a different machine. In any case, it's obvious that the issue in my case is TCP port exhaustion.

edited Feb 26 '14 at 10:19

answered Feb 25 '14 at 14:40

Luaan

62,244
7
97
116

CPU and Memory usage is very low, in the Task Mgr rarely reaches the 30% in the server, usually around 15% with the 1400 clients, memory is less of 100Mb; I do not recall the exact figures in the PerfMonitor. The two machines, server and client are in a 54Mb wireless network, consuming around 1Mb per second according to the task mgr. Not ideal conditions, but I don't think are the cause. – vtortola Feb 25 '14 at 15:17
@vtortola Yeah, but 30% CPU on a four-core machine could still mean that the one core you're actually using is at 100%. Can you check the cpu usage per-core to make sure this isn't the case? – Luaan Feb 25 '14 at 15:22
I dont recall seen one of the 8 logic CPU full, but I will check again. – vtortola Feb 25 '14 at 15:38
*Well, for one, you're running everything on one thread...* This is **incorrect**. Each time after `await` the OP's code will run on a different IO-completion port thread from `ThreadPool`. He doesn't need `Task.Run` because he doesn't do any CPU-bound work. – noseratio Feb 25 '14 at 20:51
Load is spread across processors and I have seen the process go over the 29% of 8 processors, so it is not concentrating in one. – vtortola Feb 25 '14 at 22:13
@Noseratio The fact that many operations can run out of sequence is unrelated to whether they can work in parallel on the CPU. There is no thread. You should read up on how IOCP works :) http://blog.stephencleary.com/2013/11/there-is-no-thread.html It's awesome, and lot of the work is way below the OS level - it gets to the CPU only through a hardware interrupt. The MSDN documentation is not entirely easy to read - http://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx: "In other words, a single thread can be associated with, at most, one I/O completion port." – Luaan Feb 26 '14 at 07:53
@vtortola Well, do try profiling with sampling, that's a good start. You might find out that your issues are caused by a forgotten sleep somewhere, or some action you didn't recognize as "long". Memory profiling is also useful in other ways. In any case, you have to find out where and why your code is getting stuck. So fire up the profilers, you have to get that skill up and running eventually anyway :D – Luaan Feb 26 '14 at 08:01
@vtortola Also, once more - show us your `ReadMessageAsync` and `WriteMessageAsync` code. And the code for the client too. – Luaan Feb 26 '14 at 08:20
@Noseratio The only point at which there is an IOCP thread is when the socket I/O completes. Its only job is to queue the task for completion. It does as little work as possible. The point is, there aren't 2000 threads like the OP seems to think (eg. setting the ThreadPool size), and the CPU work done by the APC thread is minimal. Yes, the application itself has 10 threads before you even enter the `Main` method, but that doesn't matter at all. The OP *is* running everything on one thread. The IOCP thread is a hidden overhad, and it only matters if you're starving the ThreadPool in your code. – Luaan Feb 26 '14 at 08:40
1

@Luaan, incorrect again. The OP is always running **at least one thread**, where the core `while (true)` loop runs and accepts connections with `AcceptTcpClientAsync`. Now imagine one of the pending `ReadMessageAsync` has completed on a random IOCP thread. The OP code after `await ReadMessageAsync` continues excuting on that thread. Whatever is inside `WriteMessageAsync` will be executing there. We've got **two threads** so far. Suddenly, another pending `ReadMessageAsync` completes: **three threads**, concurrently. And so on... – noseratio Feb 26 '14 at 08:49
An IOCP thread is perfectly fine to run the code which is processing the socket response message (and there's always some code doing just that). Offloading such code to another pool thread doesn't make sense. – noseratio Feb 26 '14 at 08:53
@Luan: in response to your *"There is no thread. You should read up on how IOCP works"*. I deleted my original reply as it looked like I'm telling you to read up, while I was quoting your original wording, which is still there. – noseratio Feb 26 '14 at 09:46
1

@Noseratio Oh boy. Fooled by specific implementation of abstractions again. Default task scheduler and no synchronization context, yikes. You're right, the OP is actually executing the code on many, many threads. And I thought *I* was the one debating an "expert beginner". Sorry :) I'll update my answer to reflect this. – Luaan Feb 26 '14 at 10:19
@Luaan, I removed my down-vote. I'm not sure I entirely understand the rest of your answer, hopefully you'll get more feedback from others. – noseratio Feb 26 '14 at 10:30

TcpListener based application that does not scale up well

2 Answers2

Increasing the thread pool size made things worse

Call Task.Run or yield with Task.Yield

ConfigureAwait(false)

Byte[]

Asynchronous does not mean `fast`

Support sync and async is tedious

Linked

TcpListener based application that does not scale up well

2 Answers2

Increasing the thread pool size made things worse

Call Task.Run or yield with Task.Yield

ConfigureAwait(false)

Byte[]

Asynchronous does not mean fast

Support sync and async is tedious

Linked

Asynchronous does not mean `fast`