Why does this CUDA code produce incorrect results under optimized builds?

Question

My default assumption when I think I've found an error in the compiler is:

No you didn't. Look again.

But this really looks like a compiler optimization bug. Let's jump straight to the code:

#include <stdint.h>
#include <cuda_runtime.h>
#include <stdio.h>

__declspec(noinline)
__device__ uint32_t Part1(uint64_t n)
{
    uint32_t result = 0;

#pragma unroll 31
    for (int x = 0; x < 33; x++)
    {
        result <<= 1;
        result |= 1;

        uint64_t Tmp = (uint64_t)result * result;
        uint64_t Tmp2 = n >> (64 - (x * 2));

        bool b = Tmp2 < Tmp;
        if (b)
        {
            result ^= 1;
        }
    }

    return result;
}

__global__ void Kernel(void)
{
    for (uint64_t x = 1234; x < 30000; x += 3333)
    {
        uint64_t res = Part1(x);
        printf("sqrt(%llu) = %llu\n", x, res);
    }
}

int main()
{
    cudaError_t ce = cudaSetDevice(0);

    ce = cudaLaunchKernel(Kernel, 1, 1, nullptr, 0, 0);
    ce = cudaDeviceSynchronize();
}

As written, the code works correctly.

sqrt(1234) = 35
sqrt(4567) = 67
sqrt(7900) = 88
sqrt(11233) = 105
sqrt(14566) = 120
sqrt(17899) = 133
sqrt(21232) = 145
sqrt(24565) = 156
sqrt(27898) = 167

Comment out the #pragma, and the routine always returns 1.

sqrt(1234) = 1
sqrt(4567) = 1
sqrt(7900) = 1
sqrt(11233) = 1
sqrt(14566) = 1
sqrt(17899) = 1
sqrt(21232) = 1
sqrt(24565) = 1
sqrt(27898) = 1

Any unroll value from 1-31 works correctly, but 32 (or omitting the line completely) fails. It also works correctly if optimizations are disabled.

While my build environment is Visual Studio 2022, it might be easier to examine this issue in godbolt. There you can easily switch between CUDA Toolkit 11.8 (which is what I'm using) and 12.0.1 (the latest GodBolt supports). All versions show the same incorrect behavior.

As a pretty simple clue about what's going wrong, here's the PTX for Part1() from the optimized build:

.func (.param .b32 func_retval0) _Z5Part1y(
.param .b64 _Z5Part1y_param_0
)
{
.reg .pred %p<2>;
.reg .b32 %r<2>;
.reg .b64 %rd<2>;

ld.param.u64 %rd1, [_Z5Part1y_param_0];
setp.ne.s64 %p1, %rd1, 0;
selp.u32 %r1, 1, 0, %p1;
st.param.b32 [func_retval0+0], %r1;
ret;

}

For whatever reasons, nvcc has optimized almost the entire routine away, in a way that alters the functionality of the code. Oops.

So I've got 2 questions:

Have I really found a compiler optimization bug? Or have I missed something obvious here?
I've had mixed success reporting bugs to the CUDA team (seems I'm a poor communicator). Maybe someone who's more of a CUDA expert can explain this problem to them in a way they'll understand?

To answer the obvious "Why don't you just...", this code is extracted from a larger routine where the more obvious solutions aren't as viable.

For completeness, here's my build information on Windows:

I'm on Windows 10, building x64 with Visual Studio 2022 (17.6.3), using CUDA Toolkit v11.8, targeting cc 7.5.

Here's the build command (wrapped to (slightly) improve readability):

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe"
-gencode=arch=compute_75,code=\"sm_75,compute_75\" 
--use-local-env 
-ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" 
-x cu   
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" 
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"     
--keep-dir x64\Release  
-maxrregcount=0  --machine 64 --compile -cudart static    
-DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS 
-Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc143.pdb /FS   /MD " 
-o C:\vss\CudaBug\x64\Release\kernel.cu.obj 
"C:\vss\CudaBug\kernel.cu"

You don't need [CUDA](https://godbolt.org/z/xzEq83KzW) to duplicate the issue. — PaulMcKenzie, Jun 22 '23 at 19:34
It appears paleonix and njuffa have put their finger on it. If I modify the original code to avoid shifting by 64, the problem is resolved. So, not really an optimization bug, just pretty standard UB. — David Wohlferd, Jun 22 '23 at 21:18
A common trick is to mask the shift count in the source, like `n >> (count & 63)`. Since x86-64 already masks 64-bit scalar (not SIMD) shift counts that way, it's free unless the code auto-vectorizes. (Compilers do actually know how shifts work on the targets they compile for and optimize this away, as in [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508)) — Peter Cordes, Jun 23 '23 at 00:08

score 3 · Accepted Answer · answered Jun 22 '23 at 21:21

The reason for the behavior observed is that the code contains an instance of undefined behavior (UB). CUDA is a dialect of C++, so the ISO-C++ standard is relevant for determining whether UB is encountered.

When x is 0, uint64_t Tmp2 = n >> (64 - (x * 2)); effectively computes n >> 64. But, according to section 5.8 of the 2011 ISO C++ standard (the latest I have readily available, but this has not changed in decades), the following applies to shift operators:

The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

Undefined behavior is just that: anything could happen. This includes, but is by no means limited to, any behavior a programmer might expect. The jocular reference is that UB could cause nasal demons to fly out of a programmer's nose.

Why does this CUDA code produce incorrect results under optimized builds?

1 Answers1