How to implement zero-copy tcp using lock-free circular buffer in C++

Question

I have multiple threads that need to consume data from a TCP stream. I wish to use a circular buffer/queue in shared memory to read from the TCP socket. The TCP receive will write directly to the circular queue. The consumers will read from the queue.

This design should enable zero-copy and zero-lock. However there are 2 different issues here.

Is it possible/efficient to read just 1 logical message from the TCP socket? If not, and I read more than 1 message, I will have to copy the residuals from this to this->next.
Is it really possible to implement a lock-less queue? I know there are atomic operations, but these can be costly too. because all CPU cache needs to be invalidated. This will effect all operations on all of my 24 cores.

Im a little rusty in low-level TCP, and not exactly clear how to tell when a message is complete. Do I look for \0 or is it implementation specific?

ty

What is a message in your space? You have lofty goals, but is the performance of the simple straight forward solution not sufficient? — Nim, Jul 02 '12 at 14:23
Lock-free does not mean "not costly", though lockfree in practice is usually orders of magnitude faster than the context switches involved in locking. What you want can be achieved solely with atomic increments, which is ultra fast, provided that you make your buffer a suitable size, e.g. 65536. All in all though, I agree with Nim. — Damon, Jul 02 '12 at 14:34
@Damon, interesting, I've not see a ring implementation that relies soley on atomic increments, may be you have a paper (link) something for this? The only implementations I've seen (and have done) rely on compare/exchange operations.. — Nim, Jul 02 '12 at 14:49
@Nim I just want to process the TCP data immediately w/o copying the buffer.. but cant have the TCP message queue up so I need a seperate thread to receive and a separate thread to process. This is a high-frequency trading application, and the buffer copy operation is too expensive. — jaybny, Jul 02 '12 at 14:50
@jaybny copying byte sequences is a very fast operation. Don't optimize unless performance-profiler shows that the copying is the bottleneck (which is very unlikely). — Igor R., Jul 02 '12 at 15:58
Memory bandwidth would be in the range of 10s of GB/s. What's your network speed? Giga **bit** ? — Bo Persson, Jul 02 '12 at 16:06
@IgorR. in my line of work any extra CPU cycles need to be eliminated. Its not a question of performance bottlenecks. Even if the TCP copy only contributes a tiny fraction of overall latency, it still needs to be eliminated. We dont measure actual performance but relative performance. If system A takes 500,000 microseconds and system B takes 500,001 microseconds, then A is "faster" then B, end of story. Its a race. winner takes all. — jaybny, Jul 02 '12 at 16:06
Copying byte sequences is indeed quite fast. It's also avoidable, and quite easily so given a good protocol that doesn't involve costly iterations of all data, looking for end-of-message characters. — Martin James, Jul 02 '12 at 21:37
If you pass object pointers, instead of actual bulk data, the size of the buffers is somewhat irrelevant - you're only passing 32/64 bits per buffer anyway. Queueing up 100MB takes the same time as queueing up 10 bytes. That, and it's actually easier to manage than some nasty, shared bulk-data queue with dodgy lock-free pointers and riddled with cache thrashing issues. — Martin James, Jul 02 '12 at 21:41
Note that "bulk data" is conceptually wrong to think about in the first place. Although a lot of data may arrive on the wire at all times, there is never something like a "bulk", there must not be. If there is, your system won't work. Data that comes in is placed on the queue and worked off by the consumers _faster than new data is coming in_. Your consumers in summary **must** be faster, with a good security margin, than any data that can possibly come in. Otherwise, your work queue, no matter how you implement it and no matter how much RAM you have, will inevitably overflow. — Damon, Jul 03 '12 at 08:51
Just FYI, you may as well try reading about [netmap](https://www.usenix.org/conference/usenixsecurity12/technical-sessions/presentation/rizzo), [pf_ring](http://www.ntop.org/products/pf_ring/), or [SolarFlare OpenOnload](http://www.openonload.org/) — nodakai, Jan 20 '14 at 06:13

Martin James · Accepted Answer · 2012-07-03T02:01:10.667

Unfortunately, TCP cannot transfer messages, only byte streams. If you want to transfer messages, you will have to apply a protocol on top. The best protocols for high performance are those that use a sanity-checkable header specifying the message length - this allows you to read the correct amount ot data directly into a suitable buffer object without iterating the data byte-by-byte looking for an end-of-message character. The buffer POINTER can then be queued off to another thread and a new buffer object created/depooled for the next message. This avoids any copying of bulk data and, for large messages, is sufficiently efficient that using a non-blocking queue for the message object pointers is somewhat pointless.

The next optimization avaialble is to pool the object *buffers to avoid continual new/dispose, recycling the *buffers in the consumer thread for re-use in the network receiving thread. This is fairly easy to do with a ConcurrentQueue, preferably blocking to allow flow-control instead of data corruption or segfaults/AV if the pool empties temporarily.

Next, add a [cacheline size] 'dead-zone' at the start of each *buffer data member, so preventing any thread from false-sharing data with any other.

The result should be a high-bandwith flow of complete messages into the consumer thread with very little latency, CPU waste or cache-thrashing. All your 24 cores can run flat-out on different data.

Copying bulk data in multithreaded apps is an admission of poor design and defeat.

Follow up..

Sounds like you're stuck with iterating the data because of the different protocols:(

False-sharing-free PDU buffer object, example:

typedef struct{
  char deadZone[256];  // anti-false-sharing
  int dataLen;
  char data[8388608]; // 8 meg of data
} SbufferData;

class TdataBuffer: public{
private:
  TbufferPool *myPool; // reference to pool used, in case more than one
  EpduState PDUstate; // enum state variable used to decode protocol
protected:
  SbufferData netData;
public:
  virtual reInit(); // zeros dataLen, resets PDUstate etc. - call when depooling a buffer
  virtual int loadPDU(char *fromHere,int len);  // loads protocol unit
  release(); // pushes 'this' back onto 'myPool'
};

loadPDU gets passed a pointer to, length of, raw network data. It returns either 0 - means that it has not yet completely assembled a PDU, or the number of bytes it ate from the raw network data to completely assemble a PDU, in which case, queue it off, depool another one and call loadPDU() with the unused remainder of the raw data, then continue with the next raw data to come in.

You can use different pools of different derived buffer classes to serve different protocols, if needed - an array of TbufferPool[Eprotocols]. TbufferPool could just be a BlockingCollection queue. Management becomes almost trivial - the buffers can be sent on queues all round your system, to a GUI to display stats, then perhaps to a logger, as long as, at the end of the chain of queues, something calls release().

Obviously, a 'real' PDU object would have loads more methods, data unions/structs, iterators maybe and a state-engine to operate the protocol, but that's the basic idea anyway. The main thing is easy management, encapsulation and, since no two threads can ever operate on the same buffer instance, no lock/synchro required to parse/access the data.

Oh, yes, and since no queue has to remain locked for longer than required to push/pop one pointer, the chances of actual contention are very low - even conventional blocking queues would hardly ever need to use kernel locking.

thanks! unfortunately I have no control over the message protocol. My code will be connecting to various public exchanges and private vendor ticker plants. hopefully they use headers/lengths.. Unless this is the standard, I will probably have to design the TCP receive code independent of the message protocol. Im sure I can find a design pattern for this or even look at ACE source? thanks @Martin James — jaybny, Jul 03 '12 at 00:11
"Next, add a [cacheline size] 'dead-zone' at the start of each *buffer data member" Im not quite sure I understand the problem this solves. Also, what is a "cachline size"? CPU specific ? 32/64 bits? ty — jaybny, Jul 03 '12 at 00:29
update: the initial project will connect to two different venues. Venue1 uses sanity-headers w length protocol. Venue2 uses a null-terminated msgId header followed by null-terminated fields of different types. Which ends up being a fixed size structure based on the msgId. No idea why the need for the nulls.. hope to be able to deduce the length for the msgId. Either way will concentrate on Venue1 for the high-performance code. Venue2s protocol is so terrible, its not worth the effort. — jaybny, Jul 03 '12 at 04:12
Your anti-false-sharing buffer will needlessly trash cache and increase the working set. Once you trigger a single additional page fault, any thoughts on cache effects are irrelevant. Your buffer zone has twice the size (Itanium and future Haswell processors) or four times the size (most present mainstream chips) of a cache line. It should ideally be just large enough to guarantee that each message starts in a new cache line. This is admittedly hard to get exactly right with variable length messages, but something like 64-sizeof(int) would still be much better. That said... — Damon, Jul 03 '12 at 08:38
... the best protocol is not one with a santiy-checkable header and a length field, but a one with fixed-sized payload. ATM works that way. It's dead simple and failsafe, and if you don't choose a stupid size (like in ATM, sigh...) then it aligns perfectly and avoids all cache issues automatically. Luckily, something like "a high-speed trading application" should lend itself perfectly to a fixed-size message format. With a fixed size, ideally a small power of two, you can do the "lockfree only with atomic adds" that I mentioned above (it only works with many assumptions, obviously). — Damon, Jul 03 '12 at 08:40
@Damon - my exmaple was that - an example. Protocols with small fixed sizes are great - inside ATM switches, ie. underneath TCP. Directly on top of windowed protocols like TCP, they are not so good.. — Martin James, Jul 03 '12 at 11:08
256 byte cacheline size ? SPARC processor ? 64 Byte Cachelines have been typical for Intel x86 processors for like the past 5 years. — user1610743, Mar 03 '14 at 15:41
`class TdataBuffer: public{` Is there the class to inhertig from missing? — mike, Aug 21 '15 at 07:51

score 0 · Answer 2 · answered Jan 20 '14 at 05:11

If you are using Windows 8 or Windows Server 2012, Registered I/O can be used which offers higher bandwidth for lower CPU than regular IOCP; it does this by cutting out kernel transitions, zero copy, amongst other things

API: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740642%28v=vs.85%29.aspx

Background info: http://www.serverframework.com/asynchronousevents/rio/

How to implement zero-copy tcp using lock-free circular buffer in C++

2 Answers2