CUDA illegal memory access was encountered

Question

I wrote a CUDA program on my own laptop which has Nvidia GTX 960M. The code works without any problems. I also implemented the error check which can be found in this thread: What is the canonical way to check for errors using the CUDA runtime API?

and also tested the code using the cuda-memcheck, which has reported 0 errors.

I want to test my code on a server which has an Nvidia Titan X. However cudaPeekAtLastError() throws the error:

illegal memory access was encountered

For both my laptop and the server I am using the following heap allocation

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);

and running the following threads and blocks:

int blockSize = 128;
int numBlocks = (nPossibilities + blockSize - 1) / blockSize;

GTX 960M has a compute capability of 5, while Titan X has 6.1 but both according the the compute capability table (Wikipedia) have a maximum of 32 active blocks and a maximum of 2048 threads per multiprocessor.

I ran the cuda-memcheck on the server and the problem of illegal memory access is due to a null pointer.

In order to solve the problem, I have increased the heap memory size allocation from 1GB to 2GB using the following lines and the problem was solved:

const size_t malloc_limit = size_t(2048) * size_t(2048) * size_t(2048);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, malloc_limit);

My question is why does this problem arises on the Titan X but doesn't occur on 960M? Why do I need to increase the heap memory size allocated for Titan X but not for 960M?

I can post my code if requested, but it's a big code with several function calls inside the kernel.

The error after cuda-memcheck is below:

GPUassert: unspecified launch failure all.cu 779
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
=========     by thread (125,0,0) in block (193,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
=========     Host Frame:./all [0x1dac1]
=========     Host Frame:./all [0x382d3]
=========     Host Frame:./all [0x9508]
=========     Host Frame:./all [0x93c0]
=========     Host Frame:./all [0x942d]
=========     Host Frame:./all [0x8d7a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Invalid __global__ write of size 8
=========     at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
=========     by thread (124,0,0) in block (193,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
=========     Host Frame:./all [0x1dac1]
=========     Host Frame:./all [0x382d3]
=========     Host Frame:./all [0x9508]
=========     Host Frame:./all [0x93c0]
=========     Host Frame:./all [0x942d]
=========     Host Frame:./all [0x8d7a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
=========     Host Frame:./all [0x3c2c6]
=========     Host Frame:./all [0x8d83]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaPeekAtLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
=========     Host Frame:./all [0x39b93]
=========     Host Frame:./all [0x8d88]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= ERROR SUMMARY: 4 errors

In my code the total number of possibilities for a combination of at most 19 digit number is calculated. This number determines the total number of threads. The possibilities are calculated by (2^n)-1 so if I choose 9 digits it will be 511, so the process will execute 511 threads in total.

Although for kernel configuration I choose the blocksize to be 128, I also give the number of possibilities (nPossibilities) as a parameter and inside the kernel I do the following:

if (idx > 0 && idx < nPossibilities)
{
 //Do something
}

On the server the code works up till 15 digits which corresponds to 32,767. 16 and above results in the error posted in the question. For 16 it will be 65,536. Does that mean for the Titan Xp ~32,000 threads in flight require 1GB of heap and above that I need to allocate more? But for 19 digits I will need 524,287 threads in total! Which is a lot! So how is 1GB enough for ~32,000 threads, while 2GB are enough for ~524,000 threads?

The size of the variables I allocate using new inside the Kernel, depends also on the number of digits. I roughly calculated the size of the allocated variables and for 15 digits it is 0.032MB, for 16 0.034MB and for 19 0.0415MB

Robert Crovella · Accepted Answer · 2020-09-08T20:50:08.617

Because the Titan Xp supports more threads "in flight" than a 960M.

Presumably in your CUDA device code, you are doing something like malloc or new (and hopefully also free or delete). These allocate out of the device heap, the size of which is controlled by the CUDA runtime API call you are using for this: cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);.

The maximum number of threads that can be running at a given time on either of these GPUs is given by 2048*number of SMs. Even if the occupancy number for your particular code is less than 2048 per SM, the number (max occupancy threads per SM) is probably the same whether on 960M or Titan Xp.

Therefore the total number of threads in flight is determined by the number of SMs. The 960M has 5 SMs, so it can have at most 2048x5 = ~10,000 threads in flight (i.e. in some stage of execution). The Titan Xp has 30 SMs, so it can have 2048x30 = ~60,000 threads in flight. That means that if each thread did a malloc of a particular size, and then later a free, you could have 10,000 of those allocations outstanding at any point in time on a 960M, but 60,000 of those outstanding at any point in time on a Titan Xp. More allocations outstanding = more demand on (device heap) memory.

So it's quite possible that you will need more available space for the device heap, on Titan Xp vs. 960M.

Thanks for your answer! Your explanation makes sense. However, based on your answer, I feel there is some missing information on my part, so I edited my question. Would be helpful if you read that and possibly edit your answer based on the new information, if your feel like anything else is needed. Also, since Titan Xp can at most run 60,000 threads in flight, should I expect it to be 6 times faster than the 960M, if both run their maximum number of threads? — ValSe, Sep 08 '20 at 22:04

CUDA illegal memory access was encountered

1 Answers1