My default assumption when I think I've found an error in the compiler is:
No you didn't. Look again.
But this really looks like a compiler optimization bug. Let's jump straight to the code:
#include <stdint.h>
#include <cuda_runtime.h>
#include <stdio.h>
__declspec(noinline)
__device__ uint32_t Part1(uint64_t n)
{
uint32_t result = 0;
#pragma unroll 31
for (int x = 0; x < 33; x++)
{
result <<= 1;
result |= 1;
uint64_t Tmp = (uint64_t)result * result;
uint64_t Tmp2 = n >> (64 - (x * 2));
bool b = Tmp2 < Tmp;
if (b)
{
result ^= 1;
}
}
return result;
}
__global__ void Kernel(void)
{
for (uint64_t x = 1234; x < 30000; x += 3333)
{
uint64_t res = Part1(x);
printf("sqrt(%llu) = %llu\n", x, res);
}
}
int main()
{
cudaError_t ce = cudaSetDevice(0);
ce = cudaLaunchKernel(Kernel, 1, 1, nullptr, 0, 0);
ce = cudaDeviceSynchronize();
}
As written, the code works correctly.
sqrt(1234) = 35
sqrt(4567) = 67
sqrt(7900) = 88
sqrt(11233) = 105
sqrt(14566) = 120
sqrt(17899) = 133
sqrt(21232) = 145
sqrt(24565) = 156
sqrt(27898) = 167
Comment out the #pragma, and the routine always returns 1.
sqrt(1234) = 1
sqrt(4567) = 1
sqrt(7900) = 1
sqrt(11233) = 1
sqrt(14566) = 1
sqrt(17899) = 1
sqrt(21232) = 1
sqrt(24565) = 1
sqrt(27898) = 1
Any unroll value from 1-31 works correctly, but 32 (or omitting the line completely) fails. It also works correctly if optimizations are disabled.
While my build environment is Visual Studio 2022, it might be easier to examine this issue in godbolt. There you can easily switch between CUDA Toolkit 11.8 (which is what I'm using) and 12.0.1 (the latest GodBolt supports). All versions show the same incorrect behavior.
As a pretty simple clue about what's going wrong, here's the PTX for Part1()
from the optimized build:
.func (.param .b32 func_retval0) _Z5Part1y(
.param .b64 _Z5Part1y_param_0
)
{
.reg .pred %p<2>;
.reg .b32 %r<2>;
.reg .b64 %rd<2>;
ld.param.u64 %rd1, [_Z5Part1y_param_0];
setp.ne.s64 %p1, %rd1, 0;
selp.u32 %r1, 1, 0, %p1;
st.param.b32 [func_retval0+0], %r1;
ret;
}
For whatever reasons, nvcc has optimized almost the entire routine away, in a way that alters the functionality of the code. Oops.
So I've got 2 questions:
- Have I really found a compiler optimization bug? Or have I missed something obvious here?
- I've had mixed success reporting bugs to the CUDA team (seems I'm a poor communicator). Maybe someone who's more of a CUDA expert can explain this problem to them in a way they'll understand?
To answer the obvious "Why don't you just...", this code is extracted from a larger routine where the more obvious solutions aren't as viable.
For completeness, here's my build information on Windows:
I'm on Windows 10, building x64 with Visual Studio 2022 (17.6.3), using CUDA Toolkit v11.8, targeting cc 7.5.
Here's the build command (wrapped to (slightly) improve readability):
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe"
-gencode=arch=compute_75,code=\"sm_75,compute_75\"
--use-local-env
-ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64"
-x cu
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"
-I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"
--keep-dir x64\Release
-maxrregcount=0 --machine 64 --compile -cudart static
-DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS
-Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc143.pdb /FS /MD "
-o C:\vss\CudaBug\x64\Release\kernel.cu.obj
"C:\vss\CudaBug\kernel.cu"