We need to transfer a lot of data (think Gbps) from ethernet to ethernet, adjusting it on the fly. Data from input (present in ram) must be scatter-gathered from pieces (many udp frames, parts of payload) to one contiguous chunk (in another ram place) and sent.
This is great opportunity to use scatter-gather DMA with memory to memory mode. But I don't see, that it is possible to use DMA engine from userspace linux on x64 cpus. Do I miss something? Maybe cpu is so optimized copying data, there is no need for userspace DMA?
We are talking about generic x64 server architecture, 32 cores.