CUDA : thread divergence optimization

Question

I am trying to understand thread divergence. I have a few differents questions.

About thread divergence, is there any performance benefit to disable a thread if he don't need to do the computation? For example:

__global__ void kernel_1()
{
    int i = f();

    // We know if this condition is false, i is less than g()
    if(threadId.x < 5)
    {
        i = min(g(), i);
    }
}

__global__ void kernel_2()
{
    int i = f();
    i = min(g(), i);
}

Which kernel is the better?

Does CUDA defines "thread divergence" only considerating code source path? For example:

__global__ void kernel_3()
{
    if(threadIdx.x < 5)
    {
        int i = g();
        printf("hello\n");
    }
    else
    {
        int i = g();
        printf("hello\n");
    }
}

In this code, both branchs have exactly the same code. So does the warp diverges or not?

score 2 · Accepted Answer · answered Dec 06 '22 at 23:03

2

Which kernel is the better?

My expectation is the first kernel is better, but there may be little or no measurable difference in performance.

Since you haven't given the definition of g(), it's possible that g() either does something that has limited throughput or g() generates memory traffic. Either one of those would be better to avoid doing. An example of a "limited throughput" operation would be any operation for which the throughput table in the programming guide lists a throughput of less than 32 ops/clock. If g() doesn't do any of these things, then its likely that there is no measurable difference between the two cases.

So does the warp diverges or not?

The warp diverges. You can verify this with CUDA binary utilities.

answered Dec 06 '22 at 23:03

Robert Crovella

143,785
11
213
257

Another related question: quoting the [blog](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), it says: `the data exchange primitives can be used in thread-divergent branches`, with an example with `__shfl_sync()`. If the warp diverges, how could it work since part of the threads are disabled in both branchs, as I understand causing a deadlock? Is it a special treatment of the compiler detecting these simple conditional constructs? – rafoo Dec 06 '22 at 23:20
this is a rather unusual aspect of these shuffle sync primitives in the cc7.0+ case that I describe [here](https://stackoverflow.com/questions/71152284/cuda-independent-thread-scheduling/71156298#71156298). That is as much description as I will offer. – Robert Crovella Dec 07 '22 at 02:25

CUDA : thread divergence optimization

1 Answers1