Avoiding thread divergence in a CUDA kernel with conditional assignments

Question

I'm trying to find ways to avoid thread divergence (branching or warp divergence) in my CUDA kernel.

For instance, I have the following conditional assignment (a and b are char values, x and y are unsigned int values):

if (a == b) { ++x; }
else        { ++y; }

Or, alternatively:

if (a == b) { ++x; }
if (a != b) { ++y; }

How can the above operations be re-written to avoid branching?

I've looked in the type casting intrinsics, but there is no casting available from bool to int. I'm thinking there might be some trick with min, max and absolute values (e.g., __sad) to obtain the appropriate integer result to add for each case (i.e., 1, 0 or 0, 1).

There doesn't seem to be a regular int absolute value function, but what I do see is:

Calculate | x − y | + z , the sum of absolute difference.
__device__  unsigned int __sad ( int  x, int  y, unsigned int  z )

Which I suppose I could provide a z = 0 argument to, in order to get a normal absolute value. Maybe something along the lines of:

const unsigned int mu = __sad(a, b, 1);
const unsigned int mv = __sad(a, b, 0);
const int u = __nv_min(1, mu);
const int v = __nv_min(1, mv);
x += u;
y += v;

However, there is no min function (see related question).

`x+=(a==b); y+=(a!=b);` see [here](https://stackoverflow.com/questions/5369770/bool-to-int-conversion) However the compiler does this for you (avoidance of branching) using something called predication, so this is probably a waste of your valuable programming time and energy — Robert Crovella, Sep 19 '19 at 01:22
@RobertCrovella Thank you - I confirmed this works in this case. I didn't want to make assumptions about the compiler optimizations supported by `nvcc`. Is there documentation on the optimizations supported for that compiler? This is the base case for some dynamic programming that is four conditionals deep in if/then/else statements, and I want to make sure I've eliminated all branching. — Parker, Sep 19 '19 at 01:35
I'm not aware of any such documentation. But given that there is no documentation and indeed no specification for how the compiler will convert your C++ source code to SASS (the assembly language executed by the GPU), its not really clear to me what sort of guarantee you have at all, based on source code. You can come up with any formulation you want. The compiler is free to use any method to generate executable code to realize your formulation. Just because you have no obvious signs of branching in your code does not mean that the compiler won't use branching. Anyway, good luck! — Robert Crovella, Sep 19 '19 at 01:43
@RobertCrovella I sincerely appreciate the info. I will have a look at the assembly files generated by the compiler. As a newcomer to CUDA, I wish examples like this were part of the [CUDA C Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html). I feel like I'm fumbling in the dark with this API, and it would save a lot of guesswork. — Parker, Sep 19 '19 at 01:49
@vallismortis unofrtunately its an API that is quite hard to learn, specially at that level.... In general, I suggest you code as it comes naturally, and once completed, you profile and then if needed improve (based on profiler information). E.g. this branching may not have impact in the overall execution if your bottleneck is memory. Optimize after you have a working code. — Ander Biguri, Sep 19 '19 at 08:44
@AnderBiguri I agree with you. I do have a working kernel (successfully ported from C++), and I'm currently in the optimization stage. — Parker, Sep 19 '19 at 13:40

score 5 · Accepted Answer · edited Feb 06 '21 at 03:32

tl;dr: Consider avoiding such supposed-micro-optimizations.

Let's see if we can determine what differences there are (if any) from the original formulation suggested in the question:

if (a == b) { ++x; }
else        { ++y; }

and the formulation suggested in another answer:

x += (a == b);
y += (a != b);

we'll use this test code:

$ cat t1513.cu
__global__ void k(char a, char b, unsigned int *dx, unsigned int *dy){

    unsigned int x=*dx;
    unsigned int y=*dy;
#ifndef USE_OPT
    if (a == b)
{
    ++x;
} else {
    ++y;
}
#else
x += (a == b);
y += (a != b);
#endif

    *dy = y;
    *dx = x;
}


$ nvcc -c t1513.cu
$ cuobjdump -sass t1513.o >out1.sass
$ nvcc -c t1513.cu -DUSE_OPT
$ cuobjdump -sass t1513.o >out2.sass
$ diff out1.sass out2.sass
28,29c28,29
<         /*0078*/                   BFE R7, R7, 0x1000;          /* 0x7000c0400071dc23 */
<                                                                 /* 0x22e04283f2828287 */
---
>         /*0078*/                   BFE R9, R7, 0x1000;          /* 0x7000c04000725c23 */
>                                                                 /* 0x22804283f2804287 */
31,41c31,41
<         /*0090*/                   ISET.EQ.AND R7, R8, R7, PT;  /* 0x110e00001c81dc23 */
<         /*0098*/                   LOP32I.AND R7, R7, 0x1;      /* 0x380000000471dc02 */
<         /*00a0*/                   LOP32I.XOR R8, R7, 0x1;      /* 0x3800000004721c82 */
<         /*00a8*/                   IADD R8, R6, R8;             /* 0x4800000020621c03 */
<         /*00b0*/                   IADD R7, R0, R7;             /* 0x480000001c01dc03 */
<         /*00b8*/                   ST.E [R4], R8;               /* 0x9400000000421c85 */
<                                                                 /* 0x200000000002f047 */
<         /*00c8*/                   ST.E [R2], R7;               /* 0x940000000021dc85 */
<         /*00d0*/                   EXIT;                        /* 0x8000000000001de7 */
<         /*00d8*/                   BRA 0xd8;                    /* 0x4003ffffe0001de7 */
<         /*00e0*/                   NOP;                         /* 0x4000000000001de4 */
---
>         /*0090*/                   ISET.NE.AND R7, R8, R9, PT;  /* 0x128e00002481dc23 */
>         /*0098*/                   ISET.EQ.AND R8, R8, R9, PT;  /* 0x110e000024821c23 */
>         /*00a0*/                   LOP32I.AND R7, R7, 0x1;      /* 0x380000000471dc02 */
>         /*00a8*/                   IADD R7, R6, R7;             /* 0x480000001c61dc03 */
>         /*00b0*/                   LOP32I.AND R6, R8, 0x1;      /* 0x3800000004819c02 */
>         /*00b8*/                   ST.E [R4], R7;               /* 0x940000000041dc85 */
>                                                                 /* 0x2000000002f04287 */
>         /*00c8*/                   IADD R6, R0, R6;             /* 0x4800000018019c03 */
>         /*00d0*/                   ST.E [R2], R6;               /* 0x9400000000219c85 */
>         /*00d8*/                   EXIT;                        /* 0x8000000000001de7 */
>         /*00e0*/                   BRA 0xe0;                    /* 0x4003ffffe0001de7 */
$

Studying the above diff output we see:

There is no branching (and indeed not even any predication) in either realization.
The supposedly "optimized" case is nearly identical, except that it is 1 instruction longer than the if/else case.

Yes, I understand this is not "your code". I can only work with what is presented.

This gives me the intuition that these types of transformations:

Require effort (potentially wasted time)
May not yield any improvement in performance
May obfuscate the code, making maintenance more difficult

Proceed as you wish, of course.

This is incredibly helpful - you just opened up a whole new world for me here. Ultimately, I was able to reduce (not eliminate) branching by simplifying the conditionals through pure boolean expressions, which eliminated the need for many intermediate variables and tests. This debugging technique with the `sass` files was also an important missing piece for me. — Parker, Sep 19 '19 at 19:22
@vallismortis: Robert beat me to this answer. I was going to write a one-liner "Don't bother with this stuff before you've seen what the compiler produces". Remember NVIDIA has compiler engineers for this kind of optimizations. Not that you can always just trust them, but your example is a really trivial and common case. — einpoklum, Sep 19 '19 at 19:30
@einpoklum This exercise has removed several of the black boxes around the kernel and `nvcc`, and I've made more progress on understanding what is going on under the hood in half a day than in the entire previous week. This has been an excellent learning experience. — Parker, Sep 19 '19 at 19:41

Parker · Answer 2 · 2019-09-19T19:34:06.897

1

As a helpful comment pointed out, I was overthinking the problem. The following works, and uses a simple bool to int conversion:

x += (a == b);
y += (a != b);

Examining the PTX assembly file before and after this change (several places in the kernel), the number of branches was reduced from 39 to 9, so this made a significant change. The nvcc compiler did not optimize these out on its own, particularly in cases where there were if/then/else statements two or three levels deep, as in:

bool ag = (ca == '.');
bool bg = (cb == '.');

bool agx = ag && apg;
bool bgx = bg && bpg;
bool gx = agx || bgx;

if (ag || bg)
{
    if (ag && bg)
    {
        // ignore
    } else {
        if (!gx)
        {
            ++gs;
            ++ps;
        }
        apg = ag;
        bpg = bg;
    }
} else {
    if (ca == cb)
    {
        ++ms;
        ++ps;
    } else {
        ++ns;
        ++ps;
    }
    apg = false;
    bpg = false;
}

Once I was able to reduce all of the assignments to boolean expressions (here are two out of the six assignments after conversion from the original kernel):

apg = (apg && !!(ag && bg)) || ((ag || bg) && !(ag && bg) && ag)
bpg = (bpg && !!(ag && bg)) || ((ag || bg) && !(ag && bg) && bg)

I was able to simplify those expressions:

apg  = (ag && !bg) || (ag && apg)
bpg  = (!ag && bg) || (bg && bpg)

And in two cases I was able to combine several expressions (multiple assignments) into a single boolean expression. Ultimately, the full set of conditionals reduced to:

ps += ((ca != '.') && (cb != '.')) || ((ca != '.') && !bpg) || ((cb != '.') && !apg);
ms += (ca == cb) && (ca != '.') && (cb != '.');
apg = ((ca == '.') && (cb != '.')) || ((ca == '.') && apg);
bpg = ((ca != '.') && (cb == '.')) || ((cb == '.') && bpg);

Based on the method from this answer, I found that the number of real branches in my kernel were ultimately reduced from 39 to 12:

cuobjdump -sass kernel_original.o > kernel_original.sass
grep BRA kernel.sass | wc -l
39

cuobjdump -sass kernel_simplified.o > kernel_simplified.sass
grep BRA kernel_opt.sass | wc -l
12

edited Sep 19 '19 at 19:34

answered Sep 19 '19 at 01:41

Parker

7,244
12
70
92

1

PTX analysis is less useful for making these determinations. It's not a reliable predictor of SASS code. PTX gets converted to SASS by the `ptxas` sub-tool inside of `nvcc`, and it is an optimizing compiler in its own right. – Robert Crovella Sep 19 '19 at 16:28
@RobertCrovella You're right, I just discovered that as well. After eliminating all of the conditionals from my kernel, I am back to 35 branches in the PTX file. – Parker Sep 19 '19 at 16:53
@RobertCrovella Interestingly, after simplifying the boolean expressions, I am back down to 9 branches in the PTX file. Some of the changes I made definitely made a difference, but I could easily undo it by over-complicating the expressions. – Parker Sep 20 '19 at 02:08
1

Combining a bunch of boolean expressions via boolean algebra in my opinion is different than the question you posed. It doesn't surprise me that the compiler is not able to do all forms of boolean reduction that you can come up with. But your original question was not about that, and it appears to me that the basic substitution you proposed in your question was just the first step in the process to getting to more efficient code. That is certainly a valid approach, but not at all what I read in your original question. You've provided new info in your answer. How could anyone know all that? – Robert Crovella Sep 20 '19 at 02:11
1

And to beat a dead horse, no one knows whether or not this is a performance limiter to your code. There is a lot of conventional wisdom that says you should do analysis driven optimization - use a tool to define the performance limiters in your code, then go after those. It's possible that warp divergence is a limiter in your code, but I see no evidence that you've ascertained that, and it's rare in my experience. Again, reduction of complex boolean arithmetic via boolean algebra is not the same as reducing warp divergence. – Robert Crovella Sep 20 '19 at 02:14
@RobertCrovella I can't disagree with you on this. I selected your answer as __the answer__ because it specifically and completely answered my question. I provided _this answer_ because this is where _your answer_ ultimately led me, and I hope it will be helpful to other newcomers to CUDA (this has been enlightening for me and I don't want others to miss out on what I've learned here). I will do some performance testing and will report back here for completeness. – Parker Sep 20 '19 at 02:15

tera · Answer 3 · 2019-09-20T11:20:15.797

tl;dr: Consider the larger picture first before applying such supposed-micro-optimizations.

Looking at Robert's example code, my first thought was

++*( (a==b) ? &x : &y);

However I was on my mobile phone and could not check the disassembly of this myself.

Robert was kind enough to insert it into his test kernel and posted the SASS diff of this idea vs. the original if/else code posted in the question:

$ cuobjdump -sass t1513.o >out3.sass
$ diff out1.sass out3.sass
13,44c13,52
<                                                                 /* 0x2230427042004307 */
<         /*0008*/                   MOV R1, c[0x0][0x44];        /* 0x2800400110005de4 */
<         /*0010*/                   MOV R4, c[0x0][0x150];       /* 0x2800400540011de4 */
<         /*0018*/                   MOV R5, c[0x0][0x154];       /* 0x2800400550015de4 */
<         /*0020*/                   MOV R2, c[0x0][0x148];       /* 0x2800400520009de4 */
<         /*0028*/                   MOV R3, c[0x0][0x14c];       /* 0x280040053000dde4 */
<         /*0030*/                   LD.E R6, [R4];               /* 0x8400000000419c85 */
<         /*0038*/                   LDC.U8 R7, c[0x0][0x141];    /* 0x1400000507f1dc06 */
<                                                                 /* 0x2272028042824047 */
<         /*0048*/                   LD.E R0, [R2];               /* 0x8400000000201c85 */
<         /*0050*/                   LDC.U8 R8, c[0x0][0x140];    /* 0x1400000503f21c06 */
<         /*0058*/                   I2I.S16.S8 R7, R7;           /* 0x1c0000001c11de84 */
<         /*0060*/                   I2I.S16.S8 R8, R8;           /* 0x1c00000020121e84 */
<         /*0068*/                   LOP32I.AND R7, R7, 0xff;     /* 0x38000003fc71dc02 */
<         /*0070*/                   LOP32I.AND R8, R8, 0xff;     /* 0x38000003fc821c02 */
<         /*0078*/                   BFE R7, R7, 0x1000;          /* 0x7000c0400071dc23 */
<                                                                 /* 0x22e04283f2828287 */
<         /*0088*/                   BFE R8, R8, 0x1000;          /* 0x7000c04000821c23 */
<         /*0090*/                   ISET.EQ.AND R7, R8, R7, PT;  /* 0x110e00001c81dc23 */
<         /*0098*/                   LOP32I.AND R7, R7, 0x1;      /* 0x380000000471dc02 */
<         /*00a0*/                   LOP32I.XOR R8, R7, 0x1;      /* 0x3800000004721c82 */
<         /*00a8*/                   IADD R8, R6, R8;             /* 0x4800000020621c03 */
<         /*00b0*/                   IADD R7, R0, R7;             /* 0x480000001c01dc03 */
<         /*00b8*/                   ST.E [R4], R8;               /* 0x9400000000421c85 */
<                                                                 /* 0x200000000002f047 */
<         /*00c8*/                   ST.E [R2], R7;               /* 0x940000000021dc85 */
<         /*00d0*/                   EXIT;                        /* 0x8000000000001de7 */
<         /*00d8*/                   BRA 0xd8;                    /* 0x4003ffffe0001de7 */
<         /*00e0*/                   NOP;                         /* 0x4000000000001de4 */
<         /*00e8*/                   NOP;                         /* 0x4000000000001de4 */
<         /*00f0*/                   NOP;                         /* 0x4000000000001de4 */
<         /*00f8*/                   NOP;                         /* 0x4000000000001de4 */
---
>                                                                      /* 0x2270420042304307 */
>         /*0008*/                   MOV R1, c[0x0][0x44];             /* 0x2800400110005de4 */
>         /*0010*/                   MOV R10, c[0x0][0x148];           /* 0x2800400520029de4 */
>         /*0018*/                   IADD32I R1, R1, -0x8;             /* 0x0bffffffe0105c02 */
>         /*0020*/                   MOV R11, c[0x0][0x14c];           /* 0x280040053002dde4 */
>         /*0028*/                   LDC.U8 R0, c[0x0][0x141];         /* 0x1400000507f01c06 */
>         /*0030*/                   MOV R8, c[0x0][0x150];            /* 0x2800400540021de4 */
>         /*0038*/                   MOV R9, c[0x0][0x154];            /* 0x2800400550025de4 */
>                                                                      /* 0x2232423240423047 */
>         /*0048*/                   LD.E R4, [R10];                   /* 0x8400000000a11c85 */
>         /*0050*/                   I2I.S16.S8 R0, R0;                /* 0x1c00000000101e84 */
>         /*0058*/                   LD.E R5, [R8];                    /* 0x8400000000815c85 */
>         /*0060*/                   LDC.U8 R2, c[0x0][0x140];         /* 0x1400000503f09c06 */
>         /*0068*/                   LOP32I.AND R0, R0, 0xff;          /* 0x38000003fc001c02 */
>         /*0070*/                   I2I.S16.S8 R2, R2;                /* 0x1c00000008109e84 */
>         /*0078*/                   BFE R0, R0, 0x1000;               /* 0x7000c04000001c23 */
>                                                                      /* 0x2283f282b2028287 */
>         /*0088*/                   LOP32I.AND R2, R2, 0xff;          /* 0x38000003fc209c02 */
>         /*0090*/                   BFE R3, R2, 0x1000;               /* 0x7000c0400020dc23 */
>         /*0098*/                   ISETP.NE.AND P0, PT, R3, R0, PT;  /* 0x1a8e00000031dc23 */
>         /*00a0*/                   LOP.OR R3, R1, c[0x0][0x24];      /* 0x680040009010dc43 */
>         /*00a8*/               @P0 IADD32I R3, R3, 0x4;              /* 0x080000001030c002 */
>         /*00b0*/                   LOP32I.AND R3, R3, 0xffffff;      /* 0x3803fffffc30dc02 */
>         /*00b8*/                   SEL R0, R4, R5, !P0;              /* 0x2010000014401c04 */
>                                                                      /* 0x22f042e3f2e28047 */
>         /*00c8*/                   STL.64 [R1], R4;                  /* 0xc800000000111ca5 */
>         /*00d0*/                   IADD32I R0, R0, 0x1;              /* 0x0800000004001c02 */
>         /*00d8*/                   STL [R3], R0;                     /* 0xc800000000301c85 */
>         /*00e0*/                   LDL.64 R6, [R1];                  /* 0xc000000000119ca5 */
>         /*00e8*/                   ST.E [R8], R7;                    /* 0x940000000081dc85 */
>         /*00f0*/                   ST.E [R10], R6;                   /* 0x9400000000a19c85 */
>         /*00f8*/                   EXIT;                             /* 0x8000000000001de7 */
>         /*0100*/                   BRA 0x100;                        /* 0x4003ffffe0001de7 */
>         /*0108*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0110*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0118*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0120*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0128*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0130*/                   NOP;                              /* 0x4000000000001de4 */
>         /*0138*/                   NOP;                              /* 0x4000000000001de4 */
$

Robert concluded that the compiler chose to use predication in this case.

The disassembly seemed to make no sense to me, until I realised that Robert inserted my one-liner in a different way than I expected. In trying to stay close to the (most likely accurately) presumed intentions of the questioner, he dereferenced the pointers into automatic variables, then inserted my one-liner (which really makes little sense in that case because taking the address of automatic variables forces them into local memory), and wrote the content of the the automatic variables back to global memory.

My thought however was to just replace the entire body of the test case with my ++*( (a==b) ? dx : dy); one-liner, which would have led to better looking assembly:

        /*0008*/                   MOV R1, c[0x0][0x44];             /* 0x2800400110005de4 */
        /*0010*/                   LDC.U8 R0, c[0x0][0x141];         /* 0x1400000507f01c06 */
        /*0018*/                   LDC.U8 R2, c[0x0][0x140];         /* 0x1400000503f09c06 */
        /*0020*/                   I2I.S16.S8 R0, R0;                /* 0x1c00000000101e84 */
        /*0028*/                   I2I.S16.S8 R2, R2;                /* 0x1c00000008109e84 */
        /*0030*/                   LOP32I.AND R0, R0, 0xff;          /* 0x38000003fc001c02 */
        /*0038*/                   LOP32I.AND R2, R2, 0xff;          /* 0x38000003fc209c02 */
                                                                     /* 0x228202c042804237 */
        /*0048*/                   BFE R0, R0, 0x1000;               /* 0x7000c04000001c23 */
        /*0050*/                   BFE R3, R2, 0x1000;               /* 0x7000c0400020dc23 */
        /*0058*/                   MOV R2, c[0x0][0x148];            /* 0x2800400520009de4 */
        /*0060*/                   ISETP.NE.AND P0, PT, R3, R0, PT;  /* 0x1a8e00000031dc23 */
        /*0068*/                   MOV R0, c[0x0][0x14c];            /* 0x2800400530001de4 */
        /*0070*/                   SEL R2, R2, c[0x0][0x150], !P0;   /* 0x2010400540209c04 */
        /*0078*/                   SEL R3, R0, c[0x0][0x154], !P0;   /* 0x201040055000dc04 */
                                                                     /* 0x20000002f04283f7 */
        /*0088*/                   LD.E R0, [R2];                    /* 0x8400000000201c85 */
        /*0090*/                   IADD32I R4, R0, 0x1;              /* 0x0800000004011c02 */
        /*0098*/                   ST.E [R2], R4;                    /* 0x9400000000211c85 */
        /*00a0*/                   EXIT;                             /* 0x8000000000001de7 */
        /*00a8*/                   BRA 0xa8;                         /* 0x4003ffffe0001de7 */
        /*00b0*/                   NOP;                              /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                              /* 0x4000000000001de4 */

This code looks better to me than Robert's testcase (by itself). But it probably is of no use to vallismortis, because in his case the variables will not be in addressable memory.

Of course, Robert's other comment about premature optimisation also applies here, even if this should actually result in faster code.

Thank you Robert for adding the disassembly, and for not using the word "stupid" anywhere. ;) — tera, Sep 20 '19 at 11:20

Avoiding thread divergence in a CUDA kernel with conditional assignments

3 Answers3

tl;dr: Consider avoiding such supposed-micro-optimizations.

tl;dr: Consider the larger picture first before applying such supposed-micro-optimizations.