I have been building a high-throughput server application for multimedia messaging, language of implementation is C++. Each server can be used in stand-alone mode or many servers can be joined together to create a DHT-based overlay network; the servers act like super-peers like in case of Skype.
The work is in progress. Currently the server can handle around 200,000 messages per second (256 byte messages) and has a max throughput of around 256 MB/s on my machine (Intel i3 Mobile 2 GHz, Fedora Core 18 (64-bit), 4 GB RAM) for messages of length 4096 bytes. The server has got two threads, one thread for handling all IOs (epoll-based, edge triggered) another one for processing the incoming messages. There is another thread for overlay management, but it doesn't matter in the current discussion.
The two threads in discussion share data using two circular buffers. Thread number 1 enqueues fresh messages for the thread number 2 using one circular buffer, while thread number 2 returns back the processed messages through the other circular Buffer. The server is completely lock-free. I am not using any synchronization primitive what-so-ever, not even atomic operations.
The circular buffers never overflow, because the messages are pooled (pre-allocated on start). In fact all vital/frequently-used data-structures are pooled to reduce memory fragmentation and to increase cache efficiency, hence we know the maximum number of messages we are ever going to create when the server starts, hence we can pre-calculate the maximum size of the buffers and then initialize the circular buffers accordingly.
Now my question: Thread #1 enqueues the serialized messages one message at a time (actually the pointers to message objects), while thread #2 takes out messages from the queue in chunks (chunks of 32/64/128), and returns back the processed messages in chunks through the second circular buffer. In case there are no new messages thread #2 keeps busy waiting, hence keeping one of the CPU cores busy all the time. How can I improve upon the design further? What are the alternatives to the busy wait strategy? I want to do this elegantly and efficiently. I have considered using semaphores, but I fear that is not the best solution for a simple reason that I have to call "sem_post" every time I enqueue a message in the thread #1 which might considerably decrease the throughput, the second thread has to call "sem_post" equal number of times to keep the semaphore from overflowing. Also I fear that a semaphore implementation might be using a mutex internally.
The second good option might be use of signal if I can discover an algorithm for raising signal only if the second thread has either "emptied the queue and is in process of calling sigwait" or is "already waiting on sigwait", in short the signal must be raised minimum number of times, although it won't hurt if signals are raised a few more times than needed. Yes, I did use Google Search, but none of the solutions I found on Internet were satisfactory. Here are a few considerations:
A. The server must waste minimum CPU cycles while making system calls, and system calls must be used a minimum number of times.
B. There must be very low overhead and the algorithm must be efficient.
C. No locking what-so-ever.
I want all options to be put on table.
Here is the link to the site where I have shared info about my server, to better understand the functionality and the purpose: www.wanhive.com