Accessing global memory in CUDA is slow

Question

I have a CUDA kernel doing some computation on a local variable (in register), and after it gets computed, its value gets written into a global array p:

__global__ void dd( float* p, int dimX, int dimY, int dimZ  )
{
    int 
        i = blockIdx.x*blockDim.x + threadIdx.x,
        j = blockIdx.y*blockDim.y + threadIdx.y,
        k = blockIdx.z*blockDim.z + threadIdx.z,
        idx = j*dimX*dimY + j*dimX +i;   

    if (i >= dimX || j >= dimY || k >= dimZ)
    {
        return;
    }   

    float val = 0;

    val = SomeComputationOnVal();

     p[idx ]=  val;
    __syncthreads();    

}

Unfortunately, this function executes very slow.

However, it runs very fast if I do this:

  __global__ void dd(   float* p, int dimX, int dimY, int dimZ  )
    {
        int 
            i = blockIdx.x*blockDim.x + threadIdx.x,
            j = blockIdx.y*blockDim.y + threadIdx.y,
            k = blockIdx.z*blockDim.z + threadIdx.z,
            idx = j*dimX*dimY + j*dimX +i;   

        if (i >= dimX || j >= dimY || k >= dimZ)
        {
            return;
        }   

        float val = 0;

        //val = SomeComputationOnVal();

         p[idx ]=  val;
        __syncthreads();    

    }

It also runs very fast if I do this:

__global__ void dd( float* p, int dimX, int dimY, int dimZ  )
{
    int 
        i = blockIdx.x*blockDim.x + threadIdx.x,
        j = blockIdx.y*blockDim.y + threadIdx.y,
        k = blockIdx.z*blockDim.z + threadIdx.z,
        idx = j*dimX*dimY + j*dimX +i;   

    if (i >= dimX || j >= dimY || k >= dimZ)
    {
        return;
    }   

    float val = 0;

    val = SomeComputationOnVal();

  //   p[idx ]=  val;
    __syncthreads();    

}

So I am confused, and have no idea how to solve this problem. I have used NSight step in, and did not find access violations.

Here is how I launch the kernel (dimX:924; dimY: 16: dimZ: 1120):

dim3 
      blockSize(8,16,2),
      gridSize(dimX/blockSize.x+1,dimY/blockSize.y, dimZ/blockSize.z);
float* dev_p;       cudaMalloc((void**)&dev_p, dimX*dimY*dimZ*sizeof(float));

dd<<<gridSize, blockSize>>>(     dev_p,dimX,dimY,dimZ);

Could anyone please gives some pointers? Because it does not make much sense to me. All computation of val is fast, and the final step is to move val into p. p never gets involved in the computation, and it only shows up once. So why is it so slow?

The computations are basically a loop over a 512 X 512 matrix. It is pretty fair amount of computation I'd say.

Are you sure it is the global memory access that takes "so long" (you don't specify times) ? By commenting the store in the global array, val is never used after setting it to 0 and then calling your "computation". The compiler might determine it is useless to do this work and might ignore this while compiling your sources. — Taro, May 24 '16 at 15:35
@Taro Really? Then how do I know if val really gets calculated regard less writing of global memory is called or not? Sorry, I did not time it, but so long means tens of minutes at least. I did not even finish it. — Nick X Tsui, May 24 '16 at 15:38
Yes. It depends on a lot of factors, but this is a "classic" optimization for most compilers. See my answer here on how to view generated PTX/SASS code and compare it along with the sources : http://stackoverflow.com/a/36477199/6172231 — Taro, May 24 '16 at 15:40
It depends on the amount of work you do in `SomeComputationOnVal`. In your last example, since val is not even used and `SomeComputationOnVal` has no parameter (no boundary effect probably), it gets optimized out. So your last sample does not measure `SomeComputationOnVal`. BTW, could you provide the contents of that function ? — Florent DUGUET, May 24 '16 at 15:56
@Taro I did what you wrote in the thread, but cannot find the report. I am working on a dll, and did not see any report open. — Nick X Tsui, May 24 '16 at 16:04

score 0 · Accepted Answer · answered May 24 '16 at 17:54

The computations you perform in the SomeComputationOnVal are extremely expensive. Each thread reads at least 1MB of data which is off cache (or in L2 at best for a small part should k vary in a small range) which totals for your run about 16 TB of data. Even on a high end gpu, it would take about 2 minutes to run, at the minimum. Not to mention everything that could slow this down.

Your function does not write any data in global memory and has no boundary effect. The compiler may decide to optimize out the method call should you not use the output.

Hence cases two and three not doing calculation are very fast. Writing 64 MB on gpu memory, with coesced threads is very fast (milliseconds range).

You can verify the generated ptx to see if code gets optimized out. Use the --keep option in nvcc and search for ptx files.

Accessing global memory in CUDA is slow

1 Answers1