Relative merits between one thread per client and queuing thread models for a threaded server?

Question

Let's say we're building a threaded server intended to run on a system with four cores. The two thread management schemes I can think of are one thread per client connection and a queuing system.

As the first system's name implies, we'll spawn one thread per client that connects to our server. Assuming one thread is always dedicated to our program's main thread of execution, we'll be able to handle up to three clients concurrently and for any more simultaneous clients than that we'll have to rely on the operating system's preemptive multitasking functionality to switch among them (or the VM's in the case of green threads).

For our second approach, we'll make two thread-safe queues. One is for incoming messages and one is for outgoing messages. In other words, requests and replies. That means we'll probably have one thread accepting incoming connections and placing their requests into the incoming queue. One or two threads will handle the processing of the incoming requests, resolving the appropriate replies, and placing those replies on the outgoing queue. Finally, we'll have one thread just taking replies off of that queue and sending them back out to the clients.

What are the pros and cons of these approaches? Notice that I didn't mention what kind of server this is. I'm assuming that which one has a better performance profile depends on whether the server handles short connections like a web servers and POP3 servers, or longer connections like a WebSocket servers, game servers, and messaging app servers.

Are there other thread management strategies besides these two?

It really depends on how many clients you intend to have and what goes on in your request processing. Thread-per-connection is easy to implement but doesn't scale AT ALL — James, May 31 '16 at 00:17
Assuming you have access to (or are willing to create) a decent thread-pool class, there really is no reason to ever use thread-per-client. The thread pool will give you the same behavior, at the same complexity level, but scale much better. — Jeremy Friesner, May 31 '16 at 00:23

score 4 · Accepted Answer · edited May 23 '17 at 12:22

I believe I've done both organizations at one time or another.

Method 1

Just so we're on the same page, the first has the main thread do a listen. Then, in a loop, it does accept. It then passes off the return value to a pthread_create and the client thread's loop does recv/send in loop processing all commands the remote client wants. When done, it cleans up and terminates.

For an example of this, see my recent answer: multi-threaded file transfer with socket

This has the virtues that the main thread and client threads are straightforward and independent. No thread waits on anything another thread is doing. No thread is waiting on anything that it doesn't have to. Thus, the client threads [plural] can all run at maximum line speed. Also, if a client thread is blocked on a recv or send, and another thread can go, it will. It is self balancing.

All thread loops are simple: wait for input, process, send output, repeat. Even the main thread is simple: sock = accept, pthread_create(sock), repeat

Another thing. The interaction between the client thread and its remote client can be anything they agree on. Any protocol or any type of data transfer.

Method 2

This is somewhat akin to an N worker model, where N is fixed.

Because the accept is [usually] blocking, we'll need a main thread that is similar to method 1. Except, that instead of firing up a new thread, it needs to malloc a control struct [or some other mgmt scheme] and put the socket in that. It then puts this on a list of client connections and then loops back to the accept

In addition to the N worker threads, you are correct. At least two control threads, one to do select/poll, recv, enqueue request and one to do wait for result, select/poll, send.

Two threads are needed to prevent one of these threads having to wait on two different things: the various sockets [as a group] and the request/result queues from the various worker threads. With a single control thread all actions would have to be non-blocking and the thread would spin like crazy.

Here is an [extremely] simplified version of what the threads look like:

// control thread for recv:
while (1) {
    // (1) do blocking poll on all client connection sockets for read
    poll(...)

    // (2) for all pending sockets do a recv for a request block and enqueue
    //     it on the request queue
    for (all in read_mask) {
        request_buf = dequeue(control_free_list)
        recv(request_buf);
        enqueue(request_list,request_buf);
    }
}

// control thread for recv:
while (1) {
    // (1) do blocking wait on result queue

    // (2) peek at all result queue elements and create aggregate write mask
    //     for poll from the socket numbers

    // (3) do blocking poll on all client connection sockets for write
    poll(...)

    // (4) for all pending sockets that can be written to
    for (all in write_mask) {
        // find and dequeue first result buffer from result queue that
        // matches the given client
        result_buf = dequeue(result_list,client_id);
        send(request_buf);
        enqueue(control_free_list,request_buf);
    }
}

// worker thread:
while (1) {
    // (1) do blocking wait on request queue
    request_buf = dequeue(request_list);

    // (2) process request ...

    // (3) do blocking poll on all client connection sockets for write
    enqueue(result_list,request_buf);
}

Now, a few things to notice. Only one request queue was used for all worker threads. The recv control thread did not try to pick an idle [or under utilized] worker thread and enqueue to a thread specific queue [this is another option to consider].

The single request queue is probably the most efficient. But, maybe, not all worker threads are created equal. Some may end up on CPU cores [or cluster nodes] that have special acceleration H/W, so some requests may have to be sent to specific threads.

And, if that is done, can a thread do "work stealing"? That is, a thread completes all its work and notices that another thread has a request in its queue [that is compatible] but hasn't been started. The thread dequeues the request and starts working on it.

Here's a big drawback to this method. The request/result blocks are of [mostly] fixed size. I've done an implementation where the control could have a field for a "side/extra" payload pointer that could be an arbitrary size.

But, if doing a large transfer file transfer, either upload or download, trying to pass this piecemeal through request blocks is not a good idea.

In the download case, the worker thread could usurp the socket temporarily and send the file data before enqueuing the result to the control thread.

But, for the upload case, if the worker tried to do the upload in a tight loop, it would conflict with recv control thread. The worker would have to [somehow] alert the control thread to not include the socket in its poll mask.

This is beginning to get complex.

And, there is overhead to all this request/result block enqueue/dequeue.

Also, the two control threads are a "hot spot". The entire throughput of the system depends on them.

And, there are interactions between the sockets. In the simple case, the recv thread can start one on one socket, but other clients wishing to send requests are delayed until the recv completes. It is a bottleneck.

This means that all recv syscalls have to be non-blocking [asynchronous]. The control thread has to manage these async requests (i.e. initiate one and wait for an async completion notification, and only then enqueue the request on the request queue).

This is beginning to get complicated.

The main benefit to wanting to do this is having a large number of simultaneous clients (e.g. 50,000) but keep the number of threads to a sane value (e.g. 100).

Another advantage to this method is that it is possible to assign priorities and use multiple priority queues

Comparison and hybrids

Meanwhile, method 1 does everything that method 2 does, but in a simpler, more robust [and, I suspect, higher throughput way].

After a method 1 client thread is created, it might split the work up and create several sub-threads. It could then act like the control threads of method 2. In fact, it might draw on these threads from a fixed N pool just like method 2.

This would compensate for a weakness of method 1, where the thread is going to do heavy computation. With a large number threads all doing computation, the system would get swamped. The queuing approach helps alleviate this. The client thread is still created/active, but it's sleeping on the result queue.

So, we've just muddied up the waters a bit more.

Either method could be the "front facing" method and have elements of the other underneath.

A given client thread [method 1] or worker thread [method 2] could farm out its work by opening [yet] another connection to a "back office" compute cluster. The cluster could be managed with either method.

So, method 1 is simpler and easier to implement and can easily accomodate most job mixes. Method 2 might be better for heavy compute servers to throttle the requests to limited resources. But, care must be taken with method 2 to avoid bottlenecks.

score 0 · Answer 2 · answered May 31 '16 at 02:51

I don't think your "second approach" is well thought out, so I'll just see if I can tell you how I find it most useful to think about these things.

Rule 1) Your throughput is maximized if all your cores are busy doing useful work. Try to keep your cores busy doing useful work.

These are things that can keep you from keeping your cores busy doing useful work:

you are keeping them busy creating threads. If tasks are short-lived, then use a thread pool so you aren't spending all your time starting up and killing threads.
you are keeping them busy switching contexts. Modern OSes are pretty good at multithreading, but if you've gotta switch jobs 10000 times per second, that overhead is going to add up. If that's a problem for you you'll have to consider and event-driven architecture or other sort of more efficient explicit scheduling.
your jobs block or wait for a long time, and you don't have the resources to run enough threads threads to keep your cores busy. This can be a problem when you're serving protocols with persistent connections that hang around doing nothing most of the time, like websocket chat. You don't want to keep a whole thread hanging around doing nothing by tying it to a single client. You'll need to architect around that.
All your jobs need some other resource besides CPU, and you're bottlenecked on that -- that's a discussion for another day.

All that said... for most request/response kinds of protocols, passing each request or connection off to a thread pool that assigns it a thread for the duration of the request is easy to implement and performant in most cases.

Rule 2) Given maximized throughput (all your cores are usefully busy), getting jobs done on a first-come, first-served basis minimizes latency and maximizes responsiveness.

This is truth, but in most servers it is not considered at all. You can run into trouble here when your server is busy and jobs have to stop, even for short moments, to perform a lot of blocking operations.

The problem is that there is nothing to tell the OS thread scheduler which thread's job came in first. Every time your thread blocks and then becomes ready, it is scheduled on equal terms with all the other threads. If the server is busy, that means that the time it takes to process your request is roughly proportional to the number of times it blocks. That is generally no good.

If you have to block a lot in the process of processing a job, and you want to minimize the overall latency of each request, you'll have to do your own scheduling that keeps track of which jobs started first. In an event-driven architecture, for example, you can give priority to handling events for jobs that started earlier. In a pipelined architecture, you can give priority to later stages of the pipeline.

Remember these two rules, design your server to keep your cores busy with useful work, and do first things first. Then you can have a fast and responsive server.

Switching jobs 10,000 times per second still provides hundreds of thousands of clock cycles per time slice. You'd need to switch millions of times per second for it to be a major concern. — Warren Dew, May 31 '16 at 06:42

Relative merits between one thread per client and queuing thread models for a threaded server?

2 Answers2

Linked