I know this topic is already asked sometimes, and I have read almost all threads and comments, but I'm still not finding the answer to my problem.
I'm working on a high-performance network library that must have TCP server and client, has to be able to accept even 30000+ connections, and the throughput has to be as high as possible.
I know very well I have to use async
methods, and I have already implemented all kinds of solutions that I have found and tested them.
In my benchmarking, only the minimal code was used to avoid any overhead in the scope, I have used profiling to minimize the CPU load, there is no more room for simple optimization, on the receiving socket the buffer data was always read, counted and discarded to avoid socket buffer fill completely.
The case is very simple, one TCP Socket listens on localhost, another TCP Socket connects to the listening socket (from the same program, on the same machine oc.), then one infinite loop starts to send 256kB sized packets with the client socket to the server socket.
A timer with 1000ms interval prints a byte counter from both sockets to the console to make the bandwidth visible then resets them for the next measurement.
I've realized the sweet-spot for packet size is 256kB and the socket's buffer size is 64kB to have the maximum throughput.
With the async/await
type methods I could reach
~370MB/s (~3.2gbps) on Windows, ~680MB/s (~5.8gbps) on Linux with mono
With the BeginReceive/EndReceive/BeginSend/EndSend
type methods I could reach
~580MB/s (~5.0gbps) on Windows, ~9GB/s (~77.3gbps) on Linux with mono
With the SocketAsyncEventArgs/ReceiveAsync/SendAsync
type methods I could reach
~1.4GB/s (~12gbps) on Windows, ~1.1GB/s (~9.4gbps) on Linux with mono
Problems are the following:
async/await
methods were the slowest, so I will not work with themBeginReceive/EndReceive
methods started new async thread together with theBeginAccept/EndAccept
methods, under Linux/mono every new instance of the socket was extremely slow (when there was no more thread in theThreadPool
mono started up new threads, but to create 25 instance of connections did take about 5 mins, creating 50 connections was impossible (program just stopped doing anything after ~30 connections).- Changing the
ThreadPool
size did not help at all, and I would not change it (it was just a debug move) - The best solution so far is
SocketAsyncEventArgs
, and that makes the highest throughput on Windows, but in Linux/mono it is slower than the Windows, and it was the opposite before.
I've benchmarked both my Windows and Linux machine with iperf,
Windows machine produced ~1GB/s (~8.58gbps), Linux machine produced ~8.5GB/s (~73.0gbps)
The weird thing is iperf
could make a weaker result than my application, but on Linux, it is much higher.
First of all, I would like to know if the results are normal, or can I get better results with a different solution?
If I decide to use the BeginReceive/EndReceive
methods (they produced relatively the highest result on Linux/mono) then how can I fix the threading problem, to make the connection instance creating fast, and eliminate the stalled state after creating multiple instances?
I continue making further benchmarks and will share the results if there is any new.
================================= UPDATE ==================================
I promised code snippets, but after many hours of experimenting the overall code is kind of a mess, so I would just share my experience in case it can help someone.
I had to realize under Window 7 the loopback device is slow, could not get higher result than 1GB/s with iperf or NTttcp, only Windows 8 and newer versions have fast loopback, so I don't care anymore about Windows results until I can test on newer version. SIO_LOOPBACK_FAST_PATH should be enabled via Socket.IOControl, but it throws exception on Windows 7.
It turned out the most powerful solution is the Completion event based SocketAsyncEventArgs implementation both on Windows and Linux/Mono. Creating a few thousand instances of the clients never messed up the ThreadPool, the program did not stop suddenly as I mentioned above. This implementation is very nice to the threading.
Creating 10 connections to the listening socket and feeding data from 10 separate thread from the ThreadPool
with the clients together could produce ~2GB/s
data traffic on Windows, and ~6GB/s
on Linux/Mono.
Increasing the client connection count did not improve the overall throughput, but the total traffic became distributed among the connections, this might be because the CPU load was 100% on all cores/threads even with 5, 10 or 200 clients.
I think overall performance is not bad, 100 clients could produce around ~500mbit/s
traffic each. (Of course this is measured in local connections, real life scenario on network would be different.)
The only observation I would share: experimenting with both the Socket in/out buffer sizes and with the program read/write buffer sizes/loop cycles highly affected the performance and very differently on Windows and on Linux/Mono.
On Windows the best performance has been reached with 128kB socket-receive
, 32kB socket-send
, 16kB program-read
and 64kB program-write
buffers.
On Linux the previous settings produced very weak performance, but 512kB socket-receive and -send
both, 256kB program-read
and 128kB program-write
buffer sizes worked the best.
Now my only problem is if I try create 10000 connecting sockets, after around 7005 it just stops creating the instances, does not throw any exceptions, and the program is running as there was no any problem, but I don't know how can it quit from a specific for
loop without break
, but it does.
Any help would be appreciated regarding anything I was talking about!