6

Is there a linux DMA mem-to-mem copy mechanism available to userspace?

I have a linux application that routinely (50-100 times a second) has to memcpy several megs (10+) of data around. Often it's not an issue, but we've begun to see evidence that it may be consuming too much of our CPU bandwidth. Current measurements put it at something like 1Gbytes/s we're moving around.

I'm aware of the dma capability in the kernel, and I see a bit of documentation talking about building custom drivers for large memory copies, for this very reason.. But it seems someone would have build a generic API for this by now. Am I wrong? Is DMA a kernel-only feature?

I should clarify, this is for Intel X86 architecture, not embedded.

Yeraze
  • 3,269
  • 4
  • 28
  • 42
  • It's kernel-only. Being able to perform DMA from user space would be a giant security hole. – nobody May 10 '14 at 11:14
  • @AndrewMedico: How would exposing a DMA memcpy syscall be a giant security hole? Several functions which are readily available (including splice and kernel aio) do DMA on behalf of user space programs. – Damon May 10 '14 at 12:14
  • If you can use memcpy to begin with you clearly have it in the address space of the process you care about - why not just not copy it and side step the whole problem? – Flexo May 10 '14 at 15:12
  • 1
    @Flexo: I have multiple threads each making changes to the space simultaneously, and generating additional copies.. – Yeraze May 11 '14 at 14:08

2 Answers2

11
  • Linux's API for DMA doesn't permit memory to memory transfers. It's only for communication between devices and memory. Look in Documentation/DMA-API.txt for more details.

  • At hardware level, the x86 DMA controller doesn't allow memory to memory transfers. It's been discussed here: DMA transfer RAM-to-RAM

  • Given that the memory bus is usually slower than the CPU, what benefit would it have to launch a kernel driven memory copy ? You'd still have to wait for the transfer to finish and its duration would still be the determined by the memory bandwidth, exactly as with a CPU driven copy.

  • If your program's performance solely depends on memory to memory copy performance, it means that it can be probably be strongly improved by avoiding copy as much as possible, or by implementing a smarter procedure such as copy on write.

Community
  • 1
  • 1
Grapsus
  • 2,714
  • 15
  • 14
  • Thanks, I suspected this was the answer but was having a hard time finding evidence. Thanks! – Yeraze May 11 '14 at 14:09
  • No and no, Yes and yes for your four items. DMA Engine may be used for generic memory copy (see DMA_PRIVATE flag), and on x86 some controllers are capable for doing m2m transfers. But practically it all makes a little sense as you put it in the last items. – 0andriy Dec 18 '21 at 09:24
3

It sounds like what you're really looking for is copy-on-write semantics then. This will mean that by default no copies are made at all, but should any given thread need to change parts of the data a copy of just that page will transparently be made on usage.

Copy-on-write will save you lots if your data is big enough that these memcpy calls are hurting:

  • No duplication of identical data (at the page level) - a reduction to the size of your working set
  • No wasted fetch/store operations until they're actually needed

DMA isn't the solution to that, it's mostly for device-host or device-device communications, not something that gets exposed to average userland processes in a useable way for this.

Instead you can use POSIX shared memory to get this behaviour:

#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>

int main() {
  // Once:
  int fd = shm_open("/cowalloc", O_RDWR|O_CREAT, 0600);
  shm_unlink("/cowalloc");
  ftruncate(fd, 1024); // This is the size of the COW regiona
  char *master = mmap(NULL, 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

  strcpy(master, "hello world, this is a demonstration of COW behaviour in Linux");

  // Per thread:
  char *thread = mmap(NULL, 1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_NORESERVE, fd, 0);

  // Demo
  printf("Master: %s\nThread: %s\n", master, thread);
  printf("\nChanging in thread:\n");
  strcpy(thread, "This is a private change");
  printf("Master: %s\nThread: %s\n", master, thread);

  return 0;
}

The basic idea here is that you do all of the global setup of the data (presumably loading from disk/network or a computation) once using MAP_SHARED. Then you can call mmap again with the same file descriptor to make additional, private mappings for every one of your threads that you think might need to write to a local copy.

The use of the MAP_NORESERVE flag here is optional - if you're only changing one page out of thousands in each thread it might make sense to use it to avoid needlessly grabbing lots of swap.

(Note that if you are loading in from disk you can optimise this even further by simply using mmap on the file).

Of course it might be cleaner and more portable to do the COW behaviour at the Object level, for example with a COW smart pointer type.

Flexo
  • 87,323
  • 22
  • 191
  • 272