Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

Question

I am using Cuda API: cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

talonmies · Accepted Answer · 2022-01-08T09:25:15.587

2

Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();

No.

However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.

edited Jan 08 '22 at 09:25

answered Jan 08 '22 at 09:05

talonmies

70,661
34
192
269

Could you please share a bit about why flushing is not required? In that case, how it is ensured that data has reached to PM, not residing in CPU caches only. – Ashikee AbHi Jan 09 '22 at 07:48
2

Data from the GPU transfers directly to host DRAM across the PCI express bus. The host CPU and its caches play no role in the transfer – talonmies Jan 09 '22 at 12:21

score 2 · Answer 2 · answered Jan 11 '22 at 22:21

Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.

For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.

Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.

You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.

Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

2 Answers2