1

Can anyone help me understand performance difference between memCopy2dA and memCopy2dB kernels?

They are supposed to copy 2D data with size xLen,yLen from one place to the other but they are using different strategies:

  • when memCopy2dA is used blocks/threads cover whole 2D space since this kernel is suppose to copy only one data point

  • when memCopy2dB is used blocks/threads are created only for one whole X row, and then each kernel is looping over Y direction to copy all data.

According to profiler (nvvp) in both cases GPU access memory pattern is 100% and X dimension is big enough to saturate device for "B" kernel (Titan X, 24SM). Unfortunately "B" kernel is slower and on my machine result is:

GB/s: 270.715
GB/s: 224.405

Additional question: Is it even possible to be close to theoretical memory bandwidth limit which is 336.48 GB/s (3505MHz * 384 bits * 2 / 8)? At least my tests shows max always around 271-272 GB/s.

Test code:

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
#include <chrono>

template<typename T>
__global__ void memCopy2dA(T *in, T *out, size_t xLen, size_t yLen) {
    int xi = blockIdx.x * blockDim.x + threadIdx.x;
    int yi = blockIdx.y * blockDim.y + threadIdx.y;
    if (xi < xLen && yi < yLen) {
        out[yi * xLen + xi] = in[yi * xLen + xi];
    }
}

template<typename T>
__global__ void memCopy2dB(T *in, T *out, size_t xLen, size_t yLen) {
    int xi = blockIdx.x * blockDim.x + threadIdx.x;
    if (xi < xLen) {
        size_t idx = xi;
        for (int y = 0; y < yLen; ++y) {
            out[idx] = in[idx];
            idx += xLen;
        }
    }
}

static void waitForCuda() {
    cudaDeviceSynchronize();
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(err));
}

int main() {
    typedef float T;

    size_t xLen = 24 * 32 * 64; //49152
    size_t yLen = 1024;
    size_t dataSize = xLen * yLen * sizeof(T);

    T *dInput;
    cudaMalloc(&dInput, dataSize);
    T *dOutput;
    cudaMalloc(&dOutput, dataSize);

    const int numOfRepetitions = 100;
    double gigabyte = 1000 * 1000 * 1000;
    {
        dim3 threadsPerBlock(64, 1);
        dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (yLen + threadsPerBlock.y - 1) / threadsPerBlock.y);

        auto startTime = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy2dA <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
            waitForCuda();
        }
        auto stopTime = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed = stopTime - startTime;
        std::cout << "GB/s: " << (2 * dataSize * numOfRepetitions) / elapsed.count() / gigabyte << std::endl;
    }
    {
        dim3 threadsPerBlock(64);
        dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x);

        auto startTime = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy2dB <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
            waitForCuda();
        }
        auto stopTime = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed = stopTime - startTime;
        std::cout << "GB/s: " << ((2 * dataSize * numOfRepetitions) / elapsed.count()) / gigabyte << std::endl;
    }

    cudaFree(dInput);
    cudaFree(dOutput);

    return 0;
}

compiled with:

nvcc -std=c++11 memTest.cu -o memTest
Krzysztof
  • 769
  • 8
  • 27
  • The first method most likely has better cache locality. The gpu most likely sees that you're accessing sequential memory addresses and can pre-load the memory cache with the needed values. Since the second kernel fetches memory from all over, the cache preloading does not have as great of an effect. – Paul Belanger Apr 24 '18 at 12:20
  • 1
    According to my understanding both kernels are doing nice aligned coalesced memory reading/writing. In kernel "B" each warp has 32 threads working with 4 byte (float) elements doing 128 bytes of RD/WR at a time. Also test data dimensions are "nice" - everything fits perfectly without having idle threads or so. That is why I'm confused. – Krzysztof Apr 24 '18 at 13:43
  • There could be a level of memory prefetching on top of the coalesced writes. The GPU can assume that since you've just requested these bytes, you're probably going to want the ones next to them pretty soon, so it can preload the cache for you. – Paul Belanger Apr 24 '18 at 15:10
  • 1
    what CUDA version is this? regarding the "additional question": the `bandwidthTest` sample code is usually a good proxy for achievable bandwidth for a copy kernel like this. there are at least 2 limiters to achieving peak theoretical bandwidth. 1. The data transfer has overhead - some bits/DRAM cycles are transferred across the bus that are not actually user data. 2. A copy kernel like this is "turning the bus around" frequently, from reads to writes. This turnaround time subtracts from achievable bandwidth. – Robert Crovella Apr 24 '18 at 15:58
  • You want to verify and try using gld128 instructions to better saturate the bus. 32 bits loads, even coalesced may not saturate it. Also using restrict keyword to enforce (or explicitly use) ldg instructions may help saturate bandwidth especially with concurrent writes. – Florent DUGUET Apr 24 '18 at 17:03
  • @RobertCrovella: CUDA 9.1.85, bandwidthTest shows 266688 MB/s for device to device bandwidht. Well, changing direction of operations in the bus sounds interesting - nevertheless I have never found any information about it (or impact it may cause). Same applies to Paul thoughts - I understand that it 'could' cause some impact if GPUs have some hardware streaming prefetchers (predicting what data could be read next) but... I have never heard about them - all info I have read is about aligned and coalesced memory operations. Thanks for some ideas what to search! – Krzysztof Apr 24 '18 at 17:07

1 Answers1

0

I found a solution how to speedup memCopy2dB kernel. Here are a tests performed on 1080Ti (TITAN X is not available to me anymore). Code from question part yields following results:

GB/s: 365.423
GB/s: 296.678

more or less it is the same percentage difference as observed earlier on Titan X. And now modified memCopy2dB kernel looks like:

template<typename T>
__global__ void memCopy2dB(T *in, T *out, size_t xLen, size_t yLen) {
    int xi = blockIdx.x * blockDim.x + threadIdx.x;
    if (xi < xLen) {
        size_t idx = xi;
        for (int y = 0; y < yLen; ++y) {
            __syncthreads();  // <------ this line added
            out[idx] = in[idx];
            idx += xLen;
        }
    }
}

There is a lot of information about how important are coalesced memory operations on warp level when all threads in warp should access same aligned segments of memory. But it seems that synchronizing warps in a block makes coalescing possible on inter-warp level probably utilizing better memory bus width on different GPUs <- this is just my "explanation" to this problem since I could not find any literature on that.

Anyway adding this one not needed line (since from code logic I do not need to sychronize warps) gives me following results for both kernels:

GB/s: 365.255
GB/s: 352.026

So even if the code execution is slow down by synchronization we get much better results. I have tried this technique on some of my code which was processing data in memCopy2dB access pattern manner and it gave me nice speedup.

Krzysztof
  • 769
  • 8
  • 27