I wrote a CUDA program on my own laptop which has Nvidia GTX 960M. The code works without any problems. I also implemented the error check which can be found in this thread: What is the canonical way to check for errors using the CUDA runtime API?
and also tested the code using the cuda-memcheck
, which has reported 0 errors.
I want to test my code on a server which has an Nvidia Titan X. However cudaPeekAtLastError()
throws the error:
illegal memory access was encountered
For both my laptop and the server I am using the following heap allocation
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);
and running the following threads and blocks:
int blockSize = 128;
int numBlocks = (nPossibilities + blockSize - 1) / blockSize;
GTX 960M has a compute capability of 5, while Titan X has 6.1 but both according the the compute capability table (Wikipedia) have a maximum of 32 active blocks and a maximum of 2048 threads per multiprocessor.
I ran the cuda-memcheck
on the server and the problem of illegal memory access is due to a null pointer.
In order to solve the problem, I have increased the heap memory size allocation from 1GB to 2GB using the following lines and the problem was solved:
const size_t malloc_limit = size_t(2048) * size_t(2048) * size_t(2048);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, malloc_limit);
My question is why does this problem arises on the Titan X but doesn't occur on 960M? Why do I need to increase the heap memory size allocated for Titan X but not for 960M?
I can post my code if requested, but it's a big code with several function calls inside the kernel.
The error after cuda-memcheck
is below:
GPUassert: unspecified launch failure all.cu 779
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
========= at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
========= by thread (125,0,0) in block (193,0,0)
========= Address 0x00000000 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
========= Host Frame:./all [0x1dac1]
========= Host Frame:./all [0x382d3]
========= Host Frame:./all [0x9508]
========= Host Frame:./all [0x93c0]
========= Host Frame:./all [0x942d]
========= Host Frame:./all [0x8d7a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
========= Host Frame:./all [0x2999]
=========
========= Invalid __global__ write of size 8
========= at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
========= by thread (124,0,0) in block (193,0,0)
========= Address 0x00000000 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
========= Host Frame:./all [0x1dac1]
========= Host Frame:./all [0x382d3]
========= Host Frame:./all [0x9508]
========= Host Frame:./all [0x93c0]
========= Host Frame:./all [0x942d]
========= Host Frame:./all [0x8d7a]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
========= Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
========= Host Frame:./all [0x3c2c6]
========= Host Frame:./all [0x8d83]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
========= Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaPeekAtLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
========= Host Frame:./all [0x39b93]
========= Host Frame:./all [0x8d88]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
========= Host Frame:./all [0x2999]
=========
========= ERROR SUMMARY: 4 errors
In my code the total number of possibilities for a combination of at most 19 digit number is calculated. This number determines the total number of threads. The possibilities are calculated by (2^n)-1
so if I choose 9 digits it will be 511, so the process will execute 511 threads in total.
Although for kernel configuration I choose the blocksize
to be 128, I also give the number of possibilities (nPossibilities
) as a parameter and inside the kernel I do the following:
if (idx > 0 && idx < nPossibilities)
{
//Do something
}
On the server the code works up till 15 digits which corresponds to 32,767. 16 and above results in the error posted in the question. For 16 it will be 65,536. Does that mean for the Titan Xp ~32,000 threads in flight require 1GB of heap and above that I need to allocate more? But for 19 digits I will need 524,287 threads in total! Which is a lot! So how is 1GB enough for ~32,000 threads, while 2GB are enough for ~524,000 threads?
The size of the variables I allocate using new
inside the Kernel, depends also on the number of digits. I roughly calculated the size of the allocated variables and for 15 digits it is 0.032MB, for 16 0.034MB and for 19 0.0415MB