What is the best way to design a server for performance?

Question

I am trying to create a server which I expect to have high performance demands. This question deals with the server core. What programming ideas best support fast performance?

Do you split sockets into different threads and call blocking recv() on each?
Do you have one thread which sits in a select() loop and then notifies another thread to process the individual ports?
Do you have one thread which processes the select() and the response?
Do you do 2 or 3 but with clusters of ports instead of all of them?
Does using blocking vs nonblocking ports matter if you use select as specified above?
What setsockopt's improve performance: TCP_NODELAY, others?

I realize that some of these depend on the use case. For example, 6 with TCP_NODELAY off would have a negative impact if there are a lot of small packets. 3 sounds like it might be faster if the response is trivial. Any other questions that I havent thought of that affect performance would be appreciated as well.

"realize that some of these depend on the use case" - yes it does.. what's the *exact* use case? have many simultanous clients? do you do CPU intensive work, etc etc etc... — Karoly Horvath, Nov 01 '12 at 01:34

score 3 · Accepted Answer · answered Nov 01 '12 at 01:31

3

I would start with a single-threaded approach: Use non-blocking I/O, and a fast polling mechanism like edge-triggered epoll on Linux. (Other platforms have similar technologies.) Centering everything around your polling loop simplifies the program design massively, so I would definitely throw signalfds, timerfds and eventfds in there, too. Then everything is handled by one central loop.

If and when you need to go multi-threaded, this may be as simple as running the main loop several times concurrently. If you set events to "one-shot", they'll be disabled from the poll until rearmed, and so the thread that processes the event can safely assume to be the only thread doing so (and re-arm the event at the end). You only need to synchronise the communication between different parts of your program, or shared data access, but a lot of synchronisation is already taken care by the poller.

answered Nov 01 '12 at 01:31

Kerrek SB

464,522
92
875
1,084

is this epoll similar to Window's CreateIOCompletionPort? – Nov 01 '12 at 05:03
1

I don't see this as a good place to start; most people are not sufficiently proficient with state machine design to consider this easy, and even if that is conceptually easy for you, the buffer management (partial messages received, write buffer full, etc.) is a huge pain. Traditionally this approach was the fastest (with slow context switching for multi-process servers), but switching between threads in the same process is faster than most syscalls (see [my question](http://stackoverflow.com/questions/5958941/) on measuring it) so now that approach is competitive. – R.. GitHub STOP HELPING ICE Nov 01 '12 at 06:05
@R..: Perhaps. It may not be the *easiest* approach, but it's very powerful and you can get a lot of mileage out of it without the need for multithreading - *and* there's a natural direction in which to scale it up if you do decide to multithread. Yes, it's complex, but C++ is the ideal language for tackling that complexity by breaking it down into smaller sub-problems. An write queue for the (rare?) event that your write fails, and when your socket polls as ready-to-write, you drain the queue first... all that can be hidden very nicely in a suitable class. – Kerrek SB Nov 01 '12 at 12:48
It sounds easy until you consider failure cases. What do you do when allocation of more buffer space fails? In the multi-threaded approach, the full state of a connection is on the calling thread's stack, and the thread just blocks if the write can't complete. Accepting new connections may fail (or stall if there's no thread ready to `accept` them) when memory is unavailable, but you don't have to handle the complexity of what to do with existing connections that can't proceed. – R.. GitHub STOP HELPING ICE Nov 01 '12 at 18:02

Nemo · Answer 2 · 2012-11-01T13:51:43.827

2

The easiest thing to code, in my opinion, is one thread per connection using blocking I/O. It is also easy to write portably using your favorite threading model.

The problem with multiplexing non-blocking I/O is maintaining state for each connection. For example, I want to write 1024 bytes but write only consumed 900... So now have to remember the 124 bytes to write them some later time. And that is just state at the raw "send a buffer" level; consider the state of your entire protocol and it can become complex quickly. Nothing impossible, of course, but it is far simpler to just use blocking calls, assuming the connections do not need to interact with each other (much).

I have used this approach for a modest number (~dozens) of connections and moved data at over a gigabyte per second sustained on a pair of 10GbE links. The Linux kernel's scheduler is pretty good at handling thread counts in this range.

For a Web server type thing serving thousands or tens of thousands of clients... Well, I have not tried personally. I have read that multiplexing techniques (epoll etc.) are faster in that scenario. So as others have said, it depends on your application.

But if your application is like mine (modest number of connections, limited interaction among them), the "one thread per connection" approach wins hands down, IMO.

edited Nov 01 '12 at 13:51

answered Nov 01 '12 at 02:19

Nemo

70,042
10
116
153

1

This is definitely the simplest design, and it also has the *potential* to be the highest performance (fewest number of user/kernel transitions), especially if you're not constantly adding and removing connections. – R.. GitHub STOP HELPING ICE Nov 01 '12 at 05:58
While I appreciate the answer for its advice about the feint of heart, the question was about performance, not codability. – chacham15 Nov 03 '12 at 16:47
@chacham15: Yeah, I should have made it more clear. In my application, I need to move 1+ gigabyte/second of data sustained on commodity (i.e. cheap) hardware. One thread per connection easily lets me saturate a 10GbE link with anywhere up to 100 connections or so. Beyond that I do not have personal experience... But as R. points out, there are reasons to expect this design to perform best in many cases. (My own is one of them.) Also, the point is not to be "faint of heart", the point is to _write maintainable code_. – Nemo Nov 03 '12 at 16:55
@Nemo yes, I suppose I should have specified more of the characteristics of my application. I did not because I wanted to get a general sense of what the options were and how they varied. TBH, today the code implements exactly what you describe (for exactly the reason of maintainability), but I wanted to continue forward in a way that I wouldn't have to shoot myself in the foot should it become necessary to go the other route. – chacham15 Nov 03 '12 at 17:02

score 0 · Answer 3 · answered Nov 01 '12 at 01:43

It depends.

This type of question is very hard to answer; that will be one of the roles of the project itself. You will need to measure the performance of your server under the work load that it's going to face and then see what options work best for your use case.

For example, set TCP_NODELAY will reduce the latency of requests, but that option is there for a reason; you will decrease throughput by setting TCP_NODELAY.

The following website has some information that you should look through: http://www.kegel.com/c10k.html. Some of it is a bit old now (by a few years), but it contains a list of the technologies that you should consider using: epoll, asynchronous I/O.

You should set about designing your system in a modular fashion so that your workers aren't tied to a specific implementation (select/poll/epoll). Things like setsockopt can be changed easily later and you shouldn't worry about them at all.

Make it work first - then make it "fast"; whatever you mean by "fast". If you want something that scales then be aware of big O of your algorithm (O(n), O(n^2) ... etc).

What is the best way to design a server for performance?

3 Answers3