Optimize Cuda Kernel time execution

Question

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.

Here you can find the code I'm talking about:

int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;

switch(computing_type)
{

    case GPU:

    HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
    HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));

    HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice)); 
    HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
    HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));

    float time;
    cudaEvent_t start, stop;

    HANDLE_ERROR( cudaEventCreate(&start) );
    HANDLE_ERROR( cudaEventCreate(&stop) );
    HANDLE_ERROR( cudaEventRecord(start, 0) );

    for(int m = 0; m < nb_loops ; m++)
    {
        diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
    }

    HANDLE_ERROR( cudaEventRecord(stop, 0) );
    HANDLE_ERROR( cudaEventSynchronize(stop) );
    HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );

    HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));

    printf("Time to generate:  %4.4f ms \n", time/nb_loops);

    break;

    case CPU:

    clock_t begin = clock(), diff;

    for (int z=0; z<nb_loops; z++)
    {
        // Apply the difference between 2 images
        for (int i = 0; i < height; i++)
        {
            tmp = i*imgresult_pitch;
            for (int j = 0; j < width; j++)
            {
                imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
            }
        }
    }
    diff = clock() - begin;

    float msec = diff*1000/CLOCKS_PER_SEC;
    msec = msec/nb_loops;
    printf("Time taken %4.4f milliseconds", msec);

    break;
}

And here is my kernel function:

__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
    int row = blockIdx.x;
    int col = threadIdx.x;
    int v = col + row*blockDim.x;

    if (row < MAX_H && col < MAX_W)
    {
        data_res[v] = (int) data2[v] - (int) data1[v];
    }
}

I obtained these execution time for each one

CPU: 1,3210ms
GPU: 0,3229ms

I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.

EDIT1: Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.

However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...

Flow.

So your GPU result is 4x faster than your CPU result? What were you expecting? — Robert Crovella, Jul 15 '16 at 14:57
How many loops are you running? There's significant overhead when copying to/from the GPU. — 3Dave, Jul 15 '16 at 15:01

einpoklum · Answer 1 · 2022-05-29T22:21:24.380

Here are two things you could do which may improve the performance of your diff kernel:

1. Let each thread do more work

In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.

So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.

2. "Restrict" your pointers

__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:

Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.

3. Consider using a "SIMD" instruction

CUDA offers this intrinsic:

__device__  unsigned int __vsubss4 ( unsigned int  a, unsigned int  b )

Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.

score -2 · Answer 2 · answered Jul 19 '16 at 19:07

I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.

I see some details that you can test:

I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.

1. Compiler constants are almost always better than using constant memory. 2. There are no loops in the kernel, and no situation where it would make sense to use `__syncthreads()` 3. All current CUDA hardware (CUDA 7.0 and CUDA 7.5) supports 1024 theads per block, and the threads per block should be a multiple of **32**, not **8**. 4. It's certainly good practice to free memory and destroy events, but it has no bearing on the issue in this question. 5. the compiler will figure all this out, and optimize it. 6. pitched allocations rarely show benefit on current (cc2.0 and higher) hardware. — Robert Crovella, Jul 19 '16 at 19:37

score -3 · Accepted Answer · answered Jul 15 '16 at 15:36

Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:

if (row < MAX_H && col < MAX_W)
{
    data_res[v] = (int) data2[v] - (int) data1[v];
}

Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:

Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.

Obviously you'll have modify your code that calls the kernels.

And on a separate note:

GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Eliminating divergence altogether is not a useful pursuit. Minimizing it is more appropriate, e.g. by appropriately distributing work among warps and threads. — einpoklum, May 29 '22 at 21:45

Optimize Cuda Kernel time execution

3 Answers3

1. Let each thread do more work

2. "Restrict" your pointers

3. Consider using a "SIMD" instruction