0

I'm new to CUDA,but so far it drives me mad. Running the following code produces: "CUDA error: unspecified launch failure". I can't get the reason of that error,the only thing I noticed is if I lower iterations count of the main loop by let's say two orders of magnitude,then it runs well. There is the example( don't try to find logic behind those operations,it's a very simplified code,made just for purpose of demonstration).

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

__global__ void Test()
{
    int table0[256];
    int table1[256];
    int table2[256];
    int table3[256];

    for (int i = 0; i < 256; i++)
    {
        table0[i] = i;
        table1[i] = i;
        table2[i] = i;
        table3[i] = i;
    }

    int input[4];
    for (int i = 0; i < 4; i++)
        input[i] = i;

    int res0, res1, res2, res3;
    for (int i = 0; i < 10000000; i++)
    {
        res0 = table0[(unsigned char)(input[0] >> 24)] ^ table1[(unsigned char)(input[1] >> 16)] ^ table2[(unsigned char)(input[2] >> 8)] ^ table3[(unsigned char)(input[3])];
        res1 = table0[(unsigned char)(input[1] >> 24)] ^ table1[(unsigned char)(input[2] >> 16)] ^ table2[(unsigned char)(input[3] >> 8)] ^ table3[(unsigned char)(input[0])];
        res2 = table0[(unsigned char)(input[2] >> 24)] ^ table1[(unsigned char)(input[3] >> 16)] ^ table2[(unsigned char)(input[0] >> 8)] ^ table3[(unsigned char)(input[1])];
        res3 = table0[(unsigned char)(input[3] >> 24)] ^ table1[(unsigned char)(input[0] >> 16)] ^ table2[(unsigned char)(input[1] >> 8)] ^ table3[(unsigned char)(input[2])];

        input[0] = res0;
        input[1] = res1;
        input[2] = res2;
        input[3] = res3;
    }
}

cudaError_t TestWithCUDA()
{
    cudaError_t cudaStatus;
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }
    Test<<<1,1>>>();
    cudaDeviceSynchronize();
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess)
        printf("CUDA error: %s\n", cudaGetErrorString(error));

Error:
    cudaDeviceReset();
    return cudaStatus;
}

int main()
{
    cudaError_t cudaStatus = TestWithCUDA();

    if (cudaStatus != cudaSuccess)
        printf("The test has  failed!\n");
    else
        printf("Done!\n");
    return 0;
}

Does anyone have any ideas?

Eugene
  • 9
  • 3
  • unspecified launch failure is basically a segfault, you are try to access a data outside of allocated memory. If you are on linux compile your code with -G option and run it through cuda-memtest. it will give you the line where you do your error. – X3liF Dec 15 '16 at 12:35
  • @X3liF well,there is literally NO PLACE for running out of allocated boundaries. All arrays are statically allocated,have 256 elements and array indexes are of unsigned char type,so,their values can't get past array boundaries. – Eugene Dec 15 '16 at 12:41
  • What OS are you running this on? Which GPU? Which CUDA version? How are you compiling it (ie. what command line settings)? Are you running with cuda-memcheck? This kernel will take a very very long time to run in debug mode. My guess is you are running on windows and are compiling a debug project. And you are hitting the windows TDR WDDM timeout. – Robert Crovella Dec 15 '16 at 13:46
  • @Eugene my comment wasn't about if you do or not segfault but to point that you can use memcheck to find the line wich is generating the error. – X3liF Dec 15 '16 at 14:41
  • 2
    The kernel modifies no global state. So in release mode, the compiler will optimize it down to an empty kernel. In debug mode this does not happen, and the kernel takes a very long time to execute, but there are no execution errors according to my test. Therefore I think there is no actual "line which is generating the error" in the kernel code, and the only detectable error occurs at the error check after the kernel launch. In some cases, a TDR timeout will show up as an unspecified launch failure. Also consistent with "if I lower iterations count of the main loop ...,then it runs well." – Robert Crovella Dec 15 '16 at 14:51
  • @Robert Crovella Indeed,it was WDDM TDR. I even didn't know about that thing. Thanks a lot. – Eugene Dec 15 '16 at 15:19

0 Answers0