CUDA Kernel is crashing without any reason with 20k+ threads

Question

I am working in a CUDA project but I am having some serious trouble for which I can't find a solution.

I implemented the project and in my PC (pA) using a NVIDIA Quadro K2000m, it works. But when I deploy the project on a cluster which has a Nvidia Tesla GPU, and in another PC (pB) (NVIDIA gtx 960m) it won't execute!

The interesting thing is that when I use the Nsight Debugger in Visual Studio on pB (second PC), it will execute and not show the error: Unspecified launch failure

this is the code of the First Kernel:

__global__ void calcKernel(float *dev_calcMatrix,

                        int *documentarray,
                        int *documentTermArray,
                        int *distincttermsarray,
                        int *distinctclassarray,
                        int *startingPointOfClassDoc,
                        int *endingPOintOfClassDoc,
                        int sizeOfDistinctClassarray,
                        int sizeOfTerms)
{

 int index = blockIdx.x * blockDim.x + threadIdx.x;

int term = distincttermsarray[index];

if (index <= sizeOfTerms) {

    for (int i = 0; i < sizeOfDistinctClassarray; i++)
    {
        int save = (index * sizeOfDistinctClassarray) + i;
        bool test = false;
        for (int j = startingPointOfClassDoc[i]; j <= endingPOintOfClassDoc[i]; j++)
        {
            if (term == documentarray[j])
            {
                printf("%i \t", index);
                dev_calcMatrix[save] = dev_calcMatrix[save] + documentTermArray[j];

                //printf("TermArray: documentTermArray[j] %d\n", dev_calcMatrix[save], documentTermArray[j]);

                test = true;
            }
        }

        if (!test) dev_calcMatrix[save] = 0;


    }
}
}

This is the code I am using to create the Threads and blocks:

float blockNotFinal = data.sizeOfDistinctTerms / 1024;
int threads = 0;
int  blocks = (int)floor(blockNotFinal);

dim3 dimGrid((blocks + 1), 1, 1);
if (data.sizeOfDistinctTerms < 1024)
{
    threads = data.sizeOfDistinctTerms;
}
else
{
    threads = 1024;
}
dim3 dimBlock(threads, 1, 1);

So, I need to create 23,652 threads. What I am doing is 23,652 / 1024 = 23.09. After I get the 23.09 value, I round it to 23 and add + 1 = 24 blocks. So I am creating 24 blocks * 1024 threads: 24,576 threads.

I know that some threads will be created even though they wont be used, and that's why I added this if statement in the beggining of the Kernel:

int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index <= sizeOfTerms (23,652 is the size)) { .... }

The problem is that I added some PRINTF() before the IF statement and after the IF statement.

Before the IF statement the maximum Index of Thread before it crashed was: 24479 Inside the IF statement the maximum Index of Threads before it crashed was: 23487.

So, from the information above, the number of threads is not going up to the maximum. Also, on the cluster it gives me another error: Illegal memory access encountered. I know that this error means that it can be that it has an Index out of bound, but I am giving the equal size of the arrays with the number of the threads.

Here is the code where I allocate the memory in the GPU:

cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

cout << "\n Allocated GPU buffers";
// Allocate GPU buffers for input and output vectors
cudaStatus = cudaMalloc((void**)&dev_calcMatrix, data.sizeOfDistinctTerms * data.sizeOfDistinctClassarray * sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_probMatrix, data.sizeOfDistinctTerms * data.sizeOfDistinctClassarray * sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&classSummationTerms, data.sizeOfDistinctClassarray * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&documentarray, data.sizeOfTotalTermsDocsFreq * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&documentTermArray, data.sizeOfTotalTermsDocsFreq * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&distincttermsarray, data.sizeOfDistinctTerms * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&distinctclassarray, data.sizeOfDistinctClassarray * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&startingPointOfClassDoc, data.sizeOfDistinctClassarray * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cudaStatus = cudaMalloc((void**)&endingPOintOfClassDoc, data.sizeOfDistinctClassarray * sizeof(int));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    goto Error;
}

cout << "\n Copied input vectors from host to GPU";
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(documentarray, data.documentarray, data.sizeOfTotalTermsDocsFreq * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(documentTermArray, data.documentTermArray, data.sizeOfTotalTermsDocsFreq * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(distincttermsarray, data.distincttermsarray, data.sizeOfDistinctTerms * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(classSummationTerms, data.classSummationTerms, data.sizeOfDistinctClassarray * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(distinctclassarray, data.distinctclassarray, data.sizeOfDistinctClassarray * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(startingPointOfClassDoc, data.startingPointOfClassDoc, data.sizeOfDistinctClassarray * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}

cudaStatus = cudaMemcpy(endingPOintOfClassDoc, data.endingPOintOfClassDoc, data.sizeOfDistinctClassarray * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed!");
    goto Error;
}


cout << "\n Now we call the CALCKERNL()";
// Launch a kernel on the GPU with one thread for each element.
calcKernel <<<dimGrid, dimBlock >>>(dev_calcMatrix,
                            documentarray, 
                            documentTermArray, 
                            distincttermsarray, 
                            distinctclassarray, 
                            startingPointOfClassDoc, 
                            endingPOintOfClassDoc,
                            sizi,
                            sizeOfTerms);

//// cudaDeviceSynchronize waits for the kernel to finish, and returns
//// any errors encountered during the launch.
//cudaStatus = cudaDeviceSynchronize();
//if (cudaStatus != cudaSuccess) {
//  fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
//  goto Error;
//}

cudaStatus = cudaStreamSynchronize(0);
if (cudaStatus != cudaSuccess) {
    //fprintf(stderr, "calcKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    cout << "\n Synchronization failed: " << cudaGetErrorString(cudaStatus);
    goto Error;
}
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "calcKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

Any idea why this is happening?

Nope, 23,652 would have been ok. The problem is, he is running 23,653 threads. — tera, Dec 08 '16 at 20:29
I think you're going to have a very hard time manufacturing a [mcve]. — user4581301, Dec 08 '16 at 20:38
Is the number of threads limited on the GPU? Is this limit different for different GPUs? — Ripi2, Dec 08 '16 at 20:41
You can use the method described [here](http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218) to localize the illegal memory access error down to a single line of code. If necessary, you can then use in-kernel `printf` or other methods (e.g. debugger) to help understand why that line of code is generating the illegal access. — Robert Crovella, Dec 08 '16 at 22:20

score 1 · Answer 1 · edited May 23 '17 at 12:01

1

Without a Minimal, Complete, and Verifiable example, or even complete code, it is impossible to answer. But already the beginning of your kernel has two bugs probably leading to out-of-bounds memory accesses:

    int index = blockIdx.x * blockDim.x + threadIdx.x;

    int term = distincttermsarray[index];

    if (index <= sizeOfTerms) {

First, using index as an array index is unsafe before checking it is within the desired range. Second, the check needs to be for index < sizeOfTerms (not <=) if sizeOfTerms is the number of array elements.

edited May 23 '17 at 12:01

Community

1
1

answered Dec 08 '16 at 21:51

tera

7,080
1
21
32

Wow I did really some bad mistakes there :(. I fixed these what you told me, but I have still the same problem :( I did a clean on Visual Studio, because that is giving me sometimes really hard problems, but it still didnt work. I compiled it and ran it in the cluster that I mentioned, but still the same problem :( – user3774470 Dec 09 '16 at 00:09
1

Run your code under cuda-memcheck and fix the problems pointed out. Prepare an MCVE. Then we'll see. – tera Dec 09 '16 at 02:34

score 0 · Answer 2 · answered Aug 24 '17 at 21:34

A easy way to find this error and fix it is to turn cuda-memcheck on as sugested by @tera and run the code using the Cuda Debugger withiout a single hitpoint. The debugger should stop in the very instant that the error occurs.

My suggestion is Nsight + Visual Studio with TDR off, so it won't be a problem if the illegal error takes some time to happen.

CUDA Kernel is crashing without any reason with 20k+ threads

2 Answers2