How can I read from the pinned (lock-page) RAM, and not from the CPU cache (use DMA zero-copy with GPU)?

Question

If I use DMA for RAM <-> GPU on CUDA C++, How can I be sure that the memory will be read from the pinned (lock-page) RAM, and not from the CPU cache?

After all, with DMA, the CPU does not know anything about the fact that someone changed the memory and about the need to synchronize the CPU (Cache<->RAM). And as far as I know, std :: memory_barier () from C + +11 does not help with DMA and will not read from RAM, but only will result in compliance between the caches L1/L2/L3. Furthermore, in general, then there is no protocol to resolution conflict between cache and RAM on CPU, but only sync protocols different levels of CPU-cache L1/L2/L3 and multi-CPUs in NUMA: MOESI / MESIF

score 3 · Accepted Answer · answered Aug 19 '12 at 17:19

3

On x86, the CPU does snoop bus traffic, so this is not a concern. On Sandy Bridge class CPUs, the PCI Express bus controller is integrated into the CPU, so the CPU actually can service GPU reads from its L3 cache, or update its cache based on writes by the GPU.

answered Aug 19 '12 at 17:19

ArchaeaSoftware

4,332
16
21

And it happens all the time? Conflicts are always resolved in favor of the latter changes, L3 cache line goes into a Forward (intel) / Owned (AMD) and then initiates memory barier for sync L3 with L1/L2 caches? – Alex Aug 19 '12 at 18:18
You can learn about memory transaction ordering and the deadlock avoidance rules on PCI Express from "PCI System Architecture, Fourth Edition" (Addison-Wesley). As far as memory barriers: in CUDA, all pending CPU writes are posted before any work is requested of the GPU (a side effect of writing to the MMIO register to kick off a DMA operation); and if you synchronize with the GPU, all writes pending as of the synchronization request are posted before the wait is resolved. The net result is that things "work as expected." – ArchaeaSoftware Aug 19 '12 at 19:22
Thanks for interested answer! That is, CUDA allows me to automatically avoid deadlocks and synchronization conflicts, and can I be calm? – Alex Aug 19 '12 at 23:05
It is not automatic; when using asynchronous memcpy's and kernel launches (esp. kernel launches that operate on mapped pinned memory) you have to be careful to properly synchronize to avoid race conditions. But as far as cache coherency goes, yeah, not much to worry about as long as the host memory is marked as cacheable (i.e. not write-combined). – ArchaeaSoftware Aug 19 '12 at 23:12
If maximum simplify, will approach a simple rule here? Create two buffers in RAM (page-locked): 1. in the first buffer, CPU only writes and the GPU reads only 2. in the second buffer, GPU only writes and the CPU reads only – Alex Aug 19 '12 at 23:57
1

Sure, that is how CUDA implements pageable memcpy. If you do the synchronization right, you can keep both CPU and GPU as busy as possible, ping-ponging between the producer and consumer buffers. – ArchaeaSoftware Aug 20 '12 at 00:32
Can there during DMA-operations options for reading data from the cache L3 with using translation lookaside buffer (TLB) and without it? Am I right that the GPU and the last level cache CPU (LLC/L3) always use physical addresses of CPU-RAM. (Even with CUDA UVA (Unified Virtual Address)) – Alex Jan 01 '13 at 12:57
1

TLBs are special caches designed to aid in address translation, so they don't get involved in servicing bus traffic. (A notable exception is if there is an IOMMU in the system, it does an extra address translation step to protect hypervisors and other privileged memory from being corrupted by DMA. If present, IOMMU's do contain TLBs to aid in the guest physical->host physical address translation.) – ArchaeaSoftware Jan 01 '13 at 18:57
Thanks for the interesting knowledge! But if the DMA (RAM and L3) does not use the TLB, then the functions mlock()/mlockall() in addition to the prohibition paging, disposes all the pages locked memory strictly sequential? (Because dedicated single block of memory through new/alloc can be fragmented using the TLB, then without a serial arrangement, he would not be able to use DMA) – Alex Jan 02 '13 at 10:45

How can I read from the pinned (lock-page) RAM, and not from the CPU cache (use DMA zero-copy with GPU)?

1 Answers1

Linked