21

I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.

I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.

I know very well I have to use async methods, and I have already implemented all kinds of solutions that I have found and tested them.

In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.

The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.

A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.

I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.

With the async/await type methods I could reach

~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono

With the BeginReceive/EndReceive/BeginSend/EndSend type methods I could reach

~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono

With the SocketAsyncEventArgs/ReceiveAsync/SendAsync type methods I could reach

~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono

Problems are the following:

  1. async/await methods were the slowest, so I will not work with them
  2. BeginReceive/EndReceive methods started new async thread together with the BeginAccept/EndAccept methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in the ThreadPool mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).
  3. Changing the ThreadPool size did not help at all, and I would not change it (it was just a debug move)
  4. The best solution so far is SocketAsyncEventArgs, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.

I've benchmarked both my Windows and Linux machine with iperf,

Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)

The weird thing is iperf could make a weaker result than my application, but on Linux, it is much higher.

First of all, I would like to know if the results are normal, or can I get better results with a different solution?

If I decide to use the BeginReceive/EndReceive methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?

I continue making further benchmarks and will share the results if there is any new.

================================= UPDATE ==================================

I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.

I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.

It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.

Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool with the clients together could produce ~2GB/s data traffic on Windows, and ~6GB/s on Linux/Mono.

Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.

I think overall performance is not bad, 100 clients could produce around ~500mbit/s traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)

The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.

On Windows the best performance has been reached with 128kB socket-receive, 32kB socket-send, 16kB program-read and 64kB program-write buffers.

On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send both, 256kB program-read and 128kB program-write buffer sizes worked the best.

Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for loop without break, but it does.

Any help would be appreciated regarding anything I was talking about!

Sean Thorburn
  • 1,728
  • 17
  • 31
beatcoder
  • 683
  • 1
  • 5
  • 19
  • 2
    unless you plan on using localhost with your final product your test results are really meaning less. If this program will be running over the internet you need to run the test over the internet to get the same kind of overheads and latencies when working with all the pieces of hardware between the server and client. – Scott Chamberlain Sep 05 '18 at 03:59
  • Also, without seeing your test code we can't say if a different solution would do better because you never showed us, with code, what you are currently doing. Text descriptions of code is not detailed enough. – Scott Chamberlain Sep 05 '18 at 04:01
  • 1
    This is a well thought-out question, however its unanswerable to a large extent, unless you happen to find someone walking by in the next day that has benchmarked a 30000 client socket solution with all your solution. maybe this would be better for code review with your test code – TheGeneral Sep 05 '18 at 04:19
  • Scott Chamberlain - Thank You for Your answer. I try create simplified test codes and will share them. My question is mainly theoretical, I'd like to know which implementation fits which operation system better, or is there known drawback (ie. mono under linux can not provide the windows performance with SocketAsyncEventArgs, maybe because it has to simulate windows events), or do I have to do something fundamentally different to reach the performance of iperf under linux, or manage the threading in a special way? – beatcoder Sep 05 '18 at 04:38
  • TheGeneral - Thank You for Your comment. As I mentioned in my previous comment, my question is mainly theoretical. I'd like to know which method fits which operation system the best, are there any known drawbacks because of the cross-platforming, or do I have to manage threading in a different way, to avoid the horrible delays and the stalled problem. – beatcoder Sep 05 '18 at 04:42
  • Because you mentioned mono, you might want to look in to writing your program targeting the .NET Core framework, that framework [can be run under linux natively.](https://learn.microsoft.com/en-us/dotnet/core/linux-prerequisites?tabs=netcore2x) – Scott Chamberlain Sep 06 '18 at 03:09
  • Scott - Thank You for the advice, I definitely will check out that framework, it seems very promising. – beatcoder Sep 10 '18 at 11:30
  • You've probably moved on with your life quite a way since this post - but I wondered how you used the .Completed event on SocketAsyncEventArgs? Not seeing your code here's a couple of gotchas. I've just been looking through the MSDN example and I see a bug there where it adds event handlers but never clears them explicitly. Also if you hang a tons of things off an single SocketAsyncEventArgs.Completed you would also get the penalty through how events get dispatched. – paulecoyote Jul 18 '20 at 21:13
  • @paulecoyote You should attach only one callback to the .Completed event handler per SocketAsyncEventArgs object, and should not detach it ever. You can reuse the SAEA object, and on dispose the method should detach itself. Normally you create a class, where you create methods for OnAccept, OnReceive, OnSent, and attach this 3 methods to every SAEA object, usually 2 object per connection RX/TX. You attach the same method to each connections objects, and determine which connection called it inside the method. On closing the socket you either dispose or save the SAEA object for later reuse. – beatcoder Jul 22 '20 at 03:50
  • @beatcoder Seems like you got what you were looking for, if I may ask, just out of curiosity, what your network library was for? – Simple Fellow Jan 05 '21 at 15:31
  • @SimpleFellow - Sorry I've just noticed your comment. This library is still under progress, and probably always will be, but it is already functional and having its place in some services, mainly in webapps (serving http/s and websocket) and a few other online services that requires high frequency and low latency data exchange. – beatcoder Jun 20 '21 at 23:08
  • @beatcoder :) you replied. Thanks. I asked because at that time I also had high performance requirements. Not that much though. Kestrel was a better choice for me with grpc streaming. – Simple Fellow Jun 22 '21 at 00:34
  • @SimpleFellow - Kestrel is not a bad choice, I've been digging the code of it, but to be honest I didn't really like it, I think it is overcomplicated in a way, but it works fine, so it is not bad at all. I don't know how reliable it is, never tried it for any project. – beatcoder Jun 22 '21 at 00:48
  • Have you try `ValueTask` with async/await ? – John Nov 25 '21 at 03:14
  • @John - No, I heaven't tried it since I wrote this question/answer. I was kind of satisfied with the result I have already with .net core async calls under linux, and on windows it is also very fast. I didn't need to implement new techniques, maybe next time when I'll be on restarting the project. Do you have any useful information why ValueTask would be an improvement? – beatcoder Nov 27 '21 at 08:41
  • 1
    @beatcoder one reason is that `ValueTask` is full async there are no thread stuff (unlike the Task) and plus it's struct so no GC overhead (with a little of stack overhead), and plus it's method override using `Memory` so there should get better performance compare to the `Task` base override. – John Jan 18 '22 at 08:59
  • @John - Thank you for the explanation, it really seems promising. I've had an eye on that class about a year ago, but didn't bother using it or looking up what is it good for. I've moved away from this socket project since it is working quite well, so when I may rewrite it again then I may use ValueTask. Definitely will make a benchmark with it. – beatcoder Jan 22 '22 at 05:39

2 Answers2

17

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.

About the approaches:

The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.

The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.

The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.

About buffer sizes:

There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it. Sending data is a bit different.

  • You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
  • You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.

In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead. But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.

On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.

But this has only advantage if the receiver side has relatively large receiving buffers too.

Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.

Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.

Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.

My conclusion:

Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).

This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.

This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.

This is such a high performance that I never could reach with dotnet built-in sockets.

When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.

My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.

Design tip:

As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need. This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.

In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.

Choosing wrong buffer sizes will result in performance loss.

Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.

Different settings may produce different performance results on different machines and/or operating systems!

Mono vs Dotnet Core:

Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.

Bonus performance tip:

If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.

If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.

In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.

I hope my experience will help some of You!

beatcoder
  • 683
  • 1
  • 5
  • 19
  • Nothing wrong with doing what you did with the answer. I have had self answered questions that went years till someone came by and had an actual good solution. – Scott Chamberlain Jul 09 '19 at 06:50
  • 1
    @ScottChamberlain | Thank You for the hint. I'm working on this library for more than a year now, and still. I've posted many questions and I rarely got any hint or answer, but I always get a notification about this question it is very popular even though it is unanswered. I decided to share my over-a-year experience with the readers, hoping it will help them, and they don't have to go through the way I went in the last year. I marked it as an answer because it actually answers the questions in the original post. I'm really hoping this is OK and helpful. – beatcoder Jul 09 '19 at 07:02
  • 2
    As you have done sucha big task, have you thought about sharr it in some way? Github lib or something? – Noman_1 Dec 19 '20 at 23:51
  • 1
    @Noman_1 - Unfortunately I should not share it since it is already a base of a few already operational services and code sharing is not permitted due to security reasons and other policies. Despite the fact I didn't plan sharing my code, I still wanted to help others running into these issues this is why I made a recap about my progress. I still would help anyone asking questions, because I've tried many ways to develop this library, and I may save time for others with my hints, but I wouldn't share the exact code we are using right now. – beatcoder Dec 22 '20 at 02:56
  • I think it would be in great interest of the dotnet community if you eventually made your code public at some point. The points you mentioned help but would eventually require anyone to put in same effort. – Nouman Qaiser Jul 13 '23 at 18:49
2

I had the same problem. You should take a look into: NetCoreServer

Every thread in the .NET clr threadpool can handle one task at one time. So to handle more async connects/reads etc., you have to change the threadpool size by using:

ThreadPool.SetMinThreads(Int32, Int32)

Using EAP (event based asynchronous pattern) is the way to go on Windows. I would use it on Linux too because of the problems you mentioned and take the performance plunge.

The best would be io completion ports on Windows, but they are not portable.

PS: when it comes to serialize objects, you are highly encouraged to use protobuf-net. It binary serializes objects up to 10x times faster than the .NET binary serializer and saves a little space too!

Martin.Martinsson
  • 1,894
  • 21
  • 25
  • 1
    Thank You for the advice, but my project is already enough advanced, my webserver now capable of serving ~140k http requests/sec on windows and close to 200k/s on linux on an average machine, 10gbps connection is already a bottleneck. I've also made an own json parser that can serialize/deserialize objects to json and binary and back, very handy, can handle streams and about 1.6x faster than msgpack or protobuf. It was months of optimalization and lot of unmanaged code using the fastest buffers and sse2+avx2 operations. It was a hell of job but it worth it. Still improving it though. – beatcoder Sep 11 '20 at 12:27
  • I've came back to this answer just to add that changing the threadpool size actually doesn't help much. It was in the first things I've tried. When you have tens of thousands of concurrent connections threadpool behaves very differently on different operation systems. Even though setting the number large, the program just stops for seconds to create new threads, and it is very bad to have hunderds or even thousands of them. My approach lets me set only a few for different tasks, dedicated to IOCP, processing & application layer, this way it works very efficiently, much less context switchings. – beatcoder Nov 20 '20 at 22:23
  • beatcoder - is there a way to contact you directly? –  May 17 '21 at 07:07
  • @Newcomer - Yes, why not, but I don't know about the rules of sharing contacts here. Do you have something in mind? Email? Is here a dedicated way to exchange private messages? – beatcoder Jun 20 '21 at 23:05