94

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;

In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

Gearoid Murphy
  • 11,834
  • 17
  • 68
  • 86

4 Answers4

102

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

As described here http://forums.nvidia.com/index.php?showtopic=164661

host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.

I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:

The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

Daniel Galasko
  • 23,617
  • 8
  • 77
  • 97
osgx
  • 90,338
  • 53
  • 357
  • 513
  • 2
    I wonder how much havok you can create by having another thread try to munlock the pages after issuing the asynchronous copy commands? – Zan Lynx Sep 20 '13 at 22:46
  • 1
    Zan Lynx, Interesting question. Why you are wanting to unlock this memory? There may be up to 2-4 GB of memory locked even on 32-bit PC, and more when PCI-express card has access to 64-bit (in real 40 or 48bit) adressing. It is much-much cheaper to buy more memory then to pay for highly expirenced (18k rep! on SO) programmer. As in Linux I think (believe), munlock will be blocked or will return error, and no damage to the system will be inflicked. – osgx Sep 20 '13 at 23:06
  • Can I apply `cudaHostRegister` to the pointer to memory mapped file? – Tomilov Anatoliy May 16 '18 at 12:28
26

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk. If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA. So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

Shen Yang
  • 477
  • 5
  • 10
5

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I think it is not a case (I write about it in early variant of my answer), as it is a real program and the `mlock()` was fast in the program (check comment #2 to the Q). – osgx Apr 20 '11 at 22:59
0

A verbose note on copying non-locked pages to locked pages.

It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.

Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.

Izana
  • 2,537
  • 27
  • 33