Does Linux have zero-copy? splice or sendfile?

Question

When splice was introduced it was discussed on the kernel list that sendfile was re-implemented based off of splice. The documentation for splice SLICE_F_MOVE states:

Attempt to move pages instead of copying. This is only a hint to the kernel: pages may still be copied if the kernel cannot move the pages from the pipe, or if the pipe buffers don't refer to full pages. The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored.

So does that mean Linux has no zero-copy method for writing to sockets? Or was this fixed at some point and nobody updated the documentation for years? Does either of sendfile or splice have a zero copy implementation in any of the latest 3.x kernel versions?

Since Google has no answer to this query, I'm creating a stackoverflow question for the next poor schmuck who wants to know if there's any benefit to using vmsplice and splice or sendfile over plain old write.

old, but perhaps relevant: http://blog.superpat.com/2010/06/01/zero-copy-in-linux-with-sendfile-and-splice/comment-page-1/ — Paul, Jun 17 '14 at 04:46
I don't know much about slice, but if you're interested in zero-copy sockets specifically, you should take a look at memory mapped sockets: https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt — gct, Jun 23 '14 at 01:53
Under "NOTES" the `splice (2)` manpage says "Though we talk of copying, actual copies are generally avoided." So very likely things are zero-copy when possible, but the kernel will not error if it cannot do things zero copy. — Spudd86, Jul 15 '14 at 21:16
@gct packet_mmap is not zero copy, as there is no way to allocate DMA friendly memory in userspace code. There was a discussion about that somewhere on the interwebs, but it's been a long time and there's very little information on zero copy. It may have changed. — Eloff, Aug 08 '14 at 18:18
That doc doesn't say the memory is _allocated_ in userspace, it says it's allocated by the kernel and _mapped_ into userspace. Of course, you may still have to copy the data into that buffer ... — Useless, Aug 12 '14 at 12:56
Here's an explanation of why it's not zero copy: http://yusufonlinux.blogspot.com/2010/11/data-link-access-and-zero-copy.html?showComment=1291517960894#c3884991672834311362 — Eloff, Oct 19 '14 at 17:22
@Eloff: The sparkling-new AF_XDP achieves true zero-copy for raw packets, using ideas borrowed from Infiniband (RDMA) and DPDK. — Nemo, Mar 13 '19 at 21:32

score 21 · Accepted Answer · answered Aug 12 '14 at 13:33

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice.

That suggests that splice, too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is... sparse, to say the least.

In particular, splice only works zero-copy if the pages were given as "gift", i.e. you don't own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application's address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.
That's obviously not correct (it can't be), but there is no good way of knowing (for you at least!) when it's safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don't ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).
Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can't just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

All in all, sendfile is nice because it just works, and it works well, and you don't need to care about any of the above details. It's not a panacea, but it sure is a great addition.
I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don't have to do.

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn't) which pointed out that making a copy usually isn't any issue, but playing tricks with pages is [won't repeat that here].

Torvalds: "I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home." http://yarchive.net/comp/linux/splice.html — Ben, Aug 12 '14 at 15:18
I've implemented vmsplice and splice with gifting pages and ack (application level ack, not tcp ack) based garbage collection in the past. It's tricky, there's a number of gotchas (that got me) and you can count on it adding about 500 lines of C++ over plain write. The easiest is if you're sending from a ring buffer because your iovec arrays are one element, the memory never needs to be freed, and you need a way of acking data sent. It can only be zero copy if your pages are DMAable (some cards can't DMA every address) and if your network card supports DMA. Otherwise the kernel will copy. — Eloff, Oct 19 '14 at 16:39
The quote from Torvalds is referring to how the BSD guys marked the pages as copy-on-write. This requires a TLB flush, which is very expensive, on the order of 2000 cycles, and if you write to that buffer before the kernel finishes with it, it does the copy anyway, putting you squarely in negative territory. — Eloff, Oct 19 '14 at 16:43
@Eloff, Linux said that with a circular buffer double the kernel buffer size, it can be done without safety checks: http://yarchive.net/comp/linux/splice.html — akostadinov, Dec 22 '15 at 16:24

score 4 · Answer 2 · answered Aug 12 '14 at 12:45

According to the relevant man page on splice as of 2014-07-08 I quote:

Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer.

Therefore, yes, splice is documented to be currently zero copy in most cases.

Does Linux have zero-copy? splice or sendfile?

2 Answers2