Unbalanced Memory Read & Write in CUDA

Question

I noticed an unbalanced memory read and write amount when profiling the underneath cuda kernel using ncu.

__global__ void kernel(void* mem, int n) {
    int* ptr = reinterpret_cast<int*>(mem);

    for (int offset = (threadIdx.x + blockIdx.x * blockDim.x)*32; offset < n; offset += blockDim.x * gridDim.x * 32) {
        #pragma unroll
        for (int i = 0; i < 16; i++) {
            ptr[offset + i] = ptr[offset + i + 16];
        }
    }
}

int main() {
    int* mem;

    int N = 1024 * 256 * 256;
    cudaMalloc((void**)&mem, sizeof(int) * N);
    cudaMemset(mem, 0, sizeof(int) * N);

    kernel<<<8192, 256>>>(mem, N);

    cudaFree(mem);

    return 0;
}

In ncu, it tells me that memory read is 305 MB while memory write is 1.07GB. I understand that there is global memory coalescing, but shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read? And even if there is no global memory coalescing for memory read, shouldn’t the memory read amount be equal to around 128MB?

Thanks.

crosspost: https://forums.developer.nvidia.com/t/unbalanced-memory-read-write/258248 — Abator Abetor, Jun 29 '23 at 10:30

score 1 · Answer 1 · answered Jul 01 '23 at 20:53

shouldn’t the memory read and write both be equal to approximately 1GB, instead of only 305 MB memory read?

The traffic you have identified:

it tells me that memory read is 305 MB while memory write is 1.07GB.

is actually traffic between the L1 and L2 cache.

The GPU L1 cache is typically described as "write-through" (e.g. slide 43). This can result in a significant “imbalance” in L1<->L2 traffic for a “balanced” read/write code: writes have the potential to trigger traffic to the L2 on each write, reads have the potential to hit in L1, therefore not generating corresponding traffic to the L2.

shouldn’t the memory read amount be equal to around 128MB?

The traffic from L1 to L2 is higher than the actual memory traffic because the L1 cache is relatively small, and cannot contain the entire memory footprint of your code. Your code has a dynamic footprint much higher than necessary to do the actual work you are doing, because of the uncoalesced access pattern and inefficient usage of memory resources. Therefore the L1 to L2 traffic can be much higher than 128MB.

With respect to the L2 to memory traffic, depending on your GPU, the L2 may also be smaller than 128MB. In this case, again, having a larger than necessary dynamic footprint (the memory being touched based on the warps in flight) coupled with inefficient memory usage means that effectively you can thrash the L2 as well, resulting in higher than necessary traffic to memory.

Unbalanced Memory Read & Write in CUDA

1 Answers1