CUDA - big number and memory allocation

Question

I have a very strange bug in program. I spent many hours on it but I have not found a solution. I wrote simple program to reproduce my issue. Maybe someone help me. I tried cuda-memcheck & What is the canonical way to check for errors using the CUDA runtime API? but I don't get any errors.

Details:

nvcc version - V6.0.1

gcc version - 4.8.1

Full code:

#include <stdio.h>

__constant__ unsigned long long int bigNumber = 83934243334343;
__device__ bool isFound = false;
__global__ void kernel(int *dev_number) {

    unsigned long long int id = threadIdx.x + (blockIdx.x * blockDim.x);
    while (id < bigNumber && isFound==false) {

        if(id == 10) {
            *dev_number = 4;
            isFound=true;
        }
        id++;
    }
}

int main(int argc, char *argv[]) {
    int number = 0;
    int *dev_number;

    printf("Number: %d\n", number);

    return 0;
}

Compilation and run:

nvcc myprogram.cu
./myprogram

When I run this program I don't get any return value. But when variable - bigNumber has smaller value or I don't use cudaMalloc & cudaMemcpy it works(it means return 0 is called). What connection has to allocate memory for another variable with a constant bigNumber? What's the problem?

shouldn't `839299365868340224` be `839299365868340224ULL`? otherwise it is an int. — mch, Apr 13 '15 at 14:06

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Now that you've modified the code to something more sensible, I get an immediate result with the following modification:

__device__ volatile bool isFound = false;

The volatile qualifier forces the compiler to omit any optimizations that would prevent each thread from reading the global copy of the variable.

From the documentation

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions (Memory Fence Functions) and memory visibility semantics of synchronization functions (Synchronization Functions).

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.

If you fail to use the volatile qualifier, then only one thread takes the early-exit condition (isFound) and all the others must loop for a very long time until their id value exceeds bigNumber

CUDA - big number and memory allocation

1 Answers1