atomicAdd causing error Unable to Launch/Execute Kernel

Question

I have the following CUDA C code:

 int i = threadIdx.x + blockIdx.x*blockDim.x;
 int stride = blockDim.x*gridDim.x;
 while(i < size)
 {
     atomicAdd(&(histo_private[buffer[i]]),1);
     i+=stride;
 }

which causes my program to crash with the error: "unable to launch/execute kernel"

Here buffer is an input array of integers to this function of size elements and histo_private is an array of integers in shared memory of histo_size elements. I know this isn't an index out of bounds error because when I use the code:

int i = threadIdx.x + blockIdx.x*blockDim.x;
int stride = blockDim.x*gridDim.x;
while(i < size)
{
     int a = histo_private[buffer[i]];
     i+=stride;
}

So I gather that there is something wrong with the atomicAdd function and/or the memory address of this 32-bit int array.

The kernel.cu file contains the following code:

// Define your kernels in this file you may use more than one kernel if you
// need to

// INSERT KERNEL(S) HERE

__global__ void histo_kernel(unsigned int* buffer, unsigned int size, int* histo, unsigned int histo_size)
{
    extern __shared__ int histo_private[];
    if(threadIdx.x < histo_size)
        histo_private[threadIdx.x] = 0;
    __syncthreads();

    // compute block's histogram
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    int stride = blockDim.x*gridDim.x;
    while(i < size)
    {
        //int a = histo_private[buffer[i]];
        atomicAdd(&(histo_private[buffer[i]]),1);
        i+=stride;
    }

    // store to global histogram
    __syncthreads();
    //if(threadIdx.x < histo_size)
    //  atomicAdd(&(histo[threadIdx.x]),histo_private[threadIdx.x]);
}   

// ensures that no bins contains more than 255 elements
__global__ void enforce_saturation(int* histo, unsigned int histo_size)
{
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if(i < histo_size)
    {
        if(histo[i] > 255)  // this will be necessary to prevent data loss
            histo[i] = 255; // when converting from int to uint8_t                          
    }
}

__global__ void construct_histo(uint8_t* histo_unpacked, int* histo, unsigned int histo_size)
{
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if(i < histo_size)
        histo_unpacked[i] = histo[i];
}

// unpacks the input array into an output array with 'spaces'
__global__ void unpack(uint8_t* in, uint8_t* out, unsigned int size)
{
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if(i < size)
    {
        out[4*i] = in[i];
        out[4*i+1] = 0;
        out[4*i+2] = 0;
        out[4*i+3] = 0;
    }
}

// converts the input uint8_t array to an int array
__global__ void convert(uint8_t* in, int* out, unsigned int size)
{
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if(i < size)
    {
        out[i] = (int) in[4*i];
    }
}

// converts the input int array to a uint8_t array 
__global__ void convert_back(int* in, uint8_t* out, unsigned int size)
{
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    if(i < size)
    {
        out[i] = (uint8_t) in[i];
    }
}



void histogram(unsigned int* input, uint8_t* bins, unsigned int num_elements, unsigned int num_bins) 
{

    int BLOCK_SIZE = (int) num_bins;
    BLOCK_SIZE = 512;
    dim3 dim_grid, dim_block;
    dim_block.x = BLOCK_SIZE; dim_block.y = dim_block.z = 1;
        dim_grid.x = 1+(num_elements-1)/BLOCK_SIZE; dim_grid.y = dim_grid.z = 1;

    // create an array of uint8_t to be converted into an array of int
    uint8_t* bins_unpacked;
    cudaMalloc((void**)&bins_unpacked, 4 * num_bins * sizeof(uint8_t));

    // unpack the input uint8_t array
    unpack<<<dim_grid,dim_block>>>(bins, bins_unpacked, num_bins);

    // need an int version of bins_d
    int* bins_int_d;
    cudaMalloc((void**)&bins_int_d, num_bins * sizeof(int));

    // convert the uint8_t array to an int array
    convert<<<dim_grid,dim_block>>>(bins_unpacked, bins_int_d, num_bins);   

    // run kernel and enforce saturation requirements
    int histo_private_size = num_bins;
    histo_kernel<<<dim_grid,dim_block,histo_private_size>>>(input, num_elements, bins_int_d, num_bins);
    enforce_saturation<<<dim_grid,dim_block>>>(bins_int_d,num_bins);

    // convert the int array back to uint8_t
    convert_back<<<dim_grid,dim_block>>>(bins_int_d, bins, num_bins);
}

While the function that calls this last histogram function is in main.cu (I did NOT make this second file--it was provided to me--also, I have been testing this on consistent data by compiling via make test-mode):

#include <stdio.h>
#include <stdint.h>

#include "support.h"
#include "kernel.cu"

int main(int argc, char* argv[])
{
    Timer timer;

    // Initialize host variables ----------------------------------------------

    #if TEST_MODE
    printf("\n***Running in test mode***\n"); fflush(stdout);
    #endif

    printf("\nSetting up the problem..."); fflush(stdout);
    startTime(&timer);

    unsigned int *in_h;
    uint8_t* bins_h;
    unsigned int *in_d;
    uint8_t* bins_d;
    unsigned int num_elements, num_bins;
    cudaError_t cuda_ret;

    if(argc == 1) {
        num_elements = 1000000;
        num_bins = 4096;
    } else if(argc == 2) {
        num_elements = atoi(argv[1]);
        num_bins = 4096;
    } else if(argc == 3) {
        num_elements = atoi(argv[1]);
        num_bins = atoi(argv[2]);
    } else {
        printf("\n    Invalid input parameters!"
           "\n    Usage: ./histogram            # Input: 1,000,000, Bins: 4,096"
           "\n    Usage: ./histogram <m>        # Input: m, Bins: 4,096"
           "\n    Usage: ./histogram <m> <n>    # Input: m, Bins: n"
           "\n");
        exit(0);
    }
    initVector(&in_h, num_elements, num_bins);
    bins_h = (uint8_t*) malloc(num_bins*sizeof(uint8_t));

    // TESTING
    for(unsigned int i = 0; i < num_bins; ++i) 
    {
        bins_h[i] = i;
        //printf("uint8_t Element %u: is %u \n", i, bins_h[i]);
    }



    stopTime(&timer); printf("%f s\n", elapsedTime(timer));
    printf("    Input size = %u\n    Number of bins = %u\n", num_elements,
        num_bins);

    // Allocate device variables ----------------------------------------------

    printf("Allocating device variables..."); fflush(stdout);
    startTime(&timer);

    cuda_ret = cudaMalloc((void**)&in_d, num_elements * sizeof(unsigned int));
    if(cuda_ret != cudaSuccess) FATAL("Unable to allocate device memory");
    cuda_ret = cudaMalloc((void**)&bins_d, num_bins * sizeof(uint8_t));
    if(cuda_ret != cudaSuccess) FATAL("Unable to allocate device memory");

    cudaDeviceSynchronize();
    stopTime(&timer); printf("%f s\n", elapsedTime(timer));

    // Copy host variables to device ------------------------------------------

    printf("Copying data from host to device..."); fflush(stdout);
    startTime(&timer);

    cuda_ret = cudaMemcpy(in_d, in_h, num_elements * sizeof(unsigned int),
        cudaMemcpyHostToDevice);
    if(cuda_ret != cudaSuccess) FATAL("Unable to copy memory to the device");

    cuda_ret = cudaMemset(bins_d, 0, num_bins * sizeof(uint8_t));
    if(cuda_ret != cudaSuccess) FATAL("Unable to set device memory");

    // TESTING
    //cuda_ret = cudaMemcpy(bins_d, bins_h, num_bins * sizeof(uint8_t),
    //    cudaMemcpyHostToDevice);
    //if(cuda_ret != cudaSuccess) FATAL("Unable to copy memory to the device");



    cudaDeviceSynchronize();
    stopTime(&timer); printf("%f s\n", elapsedTime(timer));

    // Launch kernel ----------------------------------------------------------
    printf("Launching kernel..."); fflush(stdout);
    startTime(&timer);

    histogram(in_d, bins_d, num_elements, num_bins);
    cuda_ret = cudaDeviceSynchronize();
    if(cuda_ret != cudaSuccess) FATAL("Unable to launch/execute kernel");

    stopTime(&timer); printf("%f s\n", elapsedTime(timer));

    // Copy device variables from host ----------------------------------------

    printf("Copying data from device to host..."); fflush(stdout);
    startTime(&timer);

    cuda_ret = cudaMemcpy(bins_h, bins_d, num_bins * sizeof(uint8_t),
        cudaMemcpyDeviceToHost);
    if(cuda_ret != cudaSuccess) FATAL("Unable to copy memory to host");

    cudaDeviceSynchronize();
    stopTime(&timer); printf("%f s\n", elapsedTime(timer));

    #if TEST_MODE
    printf("\nResult:\n");
    for(unsigned int binIdx = 0; binIdx < num_bins; ++binIdx) {
       printf("Bin %u: %u elements\n", binIdx, bins_h[binIdx]);
    }

    printf("\nElements Vec:\n");
    for(unsigned int i = 0; i < num_elements; ++i) {
        printf("Element %u: %u  is \n", i, in_h[i]);
    }



    #endif

    // Verify correctness -----------------------------------------------------

    printf("Verifying results..."); fflush(stdout);

    verify(in_h, bins_h, num_elements, num_bins);

    // Free memory ------------------------------------------------------------

    cudaFree(in_d); cudaFree(bins_d);
    free(in_h); free(bins_h);

    return 0;
}

You don't know it isn't an index out of bounds error. Your test code permits the compiler to assume the index isn't out of bounds with no consequences if it's not while your real code will crash if the index is out of bounds. — David Schwartz, Nov 07 '13 at 04:19
What GPU are you running this on? "UNABLE TO LAUNCH/EXECUTE KERNEL" must be a message you created in your program. Can we see the code that generates that message and the associated error checking? Can you provide the `nvcc` compile command line you are using to compile your program? There's nothing obviously wrong with the code snippet you've provided, which gets me back into the discussion about SSCCE.org code. Is there some reason you can't write a simple program around the code you've shown here, and post the whole program? — Robert Crovella, Nov 07 '13 at 04:22
@DavidSchwartz Thanks--wouldn't accessing histo_private[buffer[i]] crash the kernel if the index was out of bounds? — Erroldactyl, Nov 07 '13 at 04:25
What sort of variable is `histo_private`? I'm guessing it's `int` but can't really be sure. How about `buffer` ? Do we really need to play 20 questions to cover all the things you haven't shown? What's your objection to posting a complete program, exactly? — Robert Crovella, Nov 07 '13 at 04:25
@Erroldactyl Yes, but your code doesn't *require* an access therefore you can't be sure there is one. (The compiler is permitted to assume every access is within bounds. That line of code does nothing if the access is within bounds. Therefore the compiler is permitted to assume that line of code does nothing. Therefore it doesn't *require* an access. If you want to test if an access causes a crash, you must *ensure* there is an access by writing code that *requires* an access.) — David Schwartz, Nov 07 '13 at 04:27
Running your code with `cuda-memcheck` will help to discover if you have an out-of-bounds access. — Robert Crovella, Nov 07 '13 at 04:28
@RobertCrovella I apologize for not being able to post everything. This is part of an assignment and I don't want to post the entire .cu file that calls my kernel, but I will try to post some of it. — Erroldactyl, Nov 07 '13 at 04:30
I'm not asking you to post your work. Create a simple program around what you have shown. If you do, you'll likely discover what the issue is. — Robert Crovella, Nov 07 '13 at 04:35
@RobertCrovella I just posted most of my code. It's not my work that I'm reluctant to post but the code provided to me by my instructors. I have posted some of this. The error is generated when cudaDeviceSynchronize() != cudaSuccess — Erroldactyl, Nov 07 '13 at 04:39
@DavidSchwartz Ok--I guess I am just unsure why my code would access a particular element in the array in one scenario but not in the next when all I change is the result of that access. I have posted more code to clarify any confusion about what it does. — Erroldactyl, Nov 07 '13 at 04:44
@Erroldactyl Read my explanation over a few times until you understand it. It's *vital* to understand if you want to be a competent C programmer. This is the key part: "The compiler is permitted to assume every access is within bounds. That line of code does nothing if the access is within bounds. Therefore the compiler is permitted to assume that line of code does nothing. Therefore it doesn't require an access." — David Schwartz, Nov 07 '13 at 04:47
@DavidSchwartz OK. I think I understand what you mean now. Basically the compiler ignores this line to some extent because I am not doing anything with the variable a nor am I changing buffer or histo_private. I just replaced the atomicAdd line with histo_private[buffer[i]] = 1; and it caused an error so I assume it was in fact an index out of bounds error. — Erroldactyl, Nov 07 '13 at 04:54
David is saying that since your code as posted does nothing with `a` the compiler is in fact allowed to optimize that line of code by eliminating it -- and eliminating the access. As a result your code basically does nothing, and doesn't prove anything about your access patterns. What happens when you run your code with `cuda-memcheck` ? — Robert Crovella, Nov 07 '13 at 04:57
You still haven't posted a self-contained complete code. I don't know what's in `support.h` for example. You're not grasping the idea of SSCCE.org code. You also haven't put any [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) in the `histogram` function you wrote. — Robert Crovella, Nov 07 '13 at 05:02
Shouldn't this `int histo_private_size = num_bins;` be this: `int histo_private_size = num_bins * sizeof(int);`? — sgarizvi, Nov 07 '13 at 05:33

score 1 · Answer 1 · answered Nov 07 '13 at 05:13

1

Turns out that this was just an index out of bounds error. The element buffer[i] was greater than the length of histo_private. As another poster mentioned, this was not obvious due to the following artifact of the c compiler:

The compiler is permitted to assume every access is within bounds. That line of my test code did nothing if the access is within bounds and therefore the compiler is permitted to assume that line of code does nothing. Thus it didn't require an access so the successful run of the test code was misleading. Once that line was changed to where the variable hist_private was modified at buffer[i], runtime errors came about.

answered Nov 07 '13 at 05:13

Erroldactyl

383
1
7
17

How could the compiler have done otherwise? All of the memory in question is dynamically allocated, so the compiler couldn't know the size and enforce bounds, even if such code checking facilities existed. – talonmies Nov 07 '13 at 06:44
1

I'm going to beat a dead horse because this is a hot button for me in case you hadn't noticed. There's no way anyone could have discovered this out-of-bounds issue based on your initial posting. That is a strong motivation to suggest that folks who ask questions like these provide a complete reproducer, i.e. SSCCE.org, which is exactly the stated expectation of SO. – Robert Crovella Nov 07 '13 at 22:15
@talonmies It could easily do so at run time, which was presumably what the OP was expecting. (And there are ways to make C and C++ code do so.) – David Schwartz Nov 08 '13 at 01:44

atomicAdd causing error Unable to Launch/Execute Kernel

1 Answers1