I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write()
calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent()
with GFP_USER
, and exposes these to the userspace application by implementing mmap()
on a file in /dev/
and returning an address to the application using dma_mmap_coherent()
on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY
on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address
.
After googling for a bit, my second approach was to use a pipe and splice()
followed by vmsplice()
:
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address
.
Note that if I replace the call to vmsplice()
or send()
to a function that just prints the data pointed to by buf
(or a send()
without MSG_ZEROCOPY
), everything is working just fine; so the data is accessible to userspace, but the vmsplice()
/send(..., MSG_ZEROCOPY)
calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()
? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg()
MSG_ZEROCOPY
path in the kernel, and the call that eventually fails is get_user_pages_fast()
. This call returns -EFAULT
because check_vma_flags()
finds the VM_PFNMAP
flag set in the vma
. This flag is apparently set when the pages are mapped into user space using remap_pfn_range()
or dma_mmap_coherent()
. My next approach is to find another way to mmap
these pages.