CUDA runtime error: unspecified launch failure & Out-of-range Shared or Local Address

Question

I was running this program on one GPU with 1GB global memory. It gave the following error:

Fatal error: cudaMemcpy1 error (unspecified launch failure at CheckDevice.cu:27)
*** FAILED - ABORTING
========= Out-of-range Shared or Local Address
=========     at 0x000006a8 in grid::SetSubgridMarker(grid*, grid*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:SetAllFlags_dev(param_t*, grid*) (SetAllFlags_dev(param_t*, grid*) : 0x108)
=========     Device Frame:SetAllFlags(param_t*, grid*) (SetAllFlags(param_t*, grid*) : 0x38)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x3dc) [0xc9edc]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0 [0xa18a]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0 (cudaLaunch + 0x17f) [0x2f4cf]
=========     Host Frame:Transport [0xd395]
=========     Host Frame:Transport [0xd7bd]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:Transport [0x17bd]
=========
========= Program hit error 4 on CUDA API call to cudaMemcpy 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26a180]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0 (cudaMemcpy + 0x271) [0x348e1]
=========     Host Frame:Transport [0x2cea]
=========     Host Frame:Transport [0x3769]
=========     Host Frame:Transport [0xd7ee]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:Transport [0x17bd]
=========
========= Program hit error 4 on CUDA API call to cudaGetLastError 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x26a180]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.5.0 (cudaGetLastError + 0x1e6) [0x2a046]
=========     Host Frame:Transport [0x2cef]
=========     Host Frame:Transport [0x3769]
=========     Host Frame:Transport [0xd7ee]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
=========     Host Frame:Transport [0x17bd]
=========
========= ERROR SUMMARY: 3 errors

For the unspecified launch failure error, the relevant lines of code is a cudaMemcpy operation:

cudaMemcpy(CurrentGrid, Grid_dev, sizeof(grid), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy1 error");

Then as shown in the error message, it said Out-of-range Shared or Local Address at 0x000006a8 in grid::SetSubgridMarker(grid*, grid*). Is it because of running out of global memory on device? Is there a way to return the memory usage on device?

In the source code, checkDevice.cu is executed after grid::SetSubgridMarker and checkDevice does not consume much memory space on device, so I'm guessing(but without much confidence) it's grid::SetSubgridMarker that exhaust the memory so that there's no space to launch cudaMemcpy operation. Any suggestions? Thanks very much!

Robert Crovella · Accepted Answer · 2017-07-02T00:54:16.960

The unspecified launch failure is not due to the cudaMemcpy operation. It is a "left over" error from the kernel launch that immediately preceded that operation.

The kernel launch is failing probably because of the memory out-of-bounds accesses that are being reported because you are running your code with cuda-memcheck.

You should inspect your kernel code in SetSubGridMarker for an invalid access to shared or local memory.

None of this means you are running out of global memory on the device.

If I have an array in C like this:

int C[5];

And then I try to access an element like this:

int temp = C[6];

That is an out-of-bounds access. You are accessing beyond the end of your defined variable storage. It does not mean you are "running out of memory".

Something like this is going on in your SetSubGridMarker code. You need to find out what that is and fix it. cuda-memcheck is also giving you a clue by telling you that thread (0,0,0) in block (0,0,0) is making this illegal access. By looking carefully at how this thread is indexing into data stored in local or shared memory, you should be able to discover the error.

You can also use a method such as described here to have cuda-memcheck identify the specific line of kernel code that is generating the fault.

Hi Robert, thanks for the answer. Now I'm trying to use cuda-gdb to debug the code but it keeps complaining "all cuda devices are used for display and cannot be used while debugging" although I already stopped kdm by "sudo service kdm stop" and logged in from tty1. I've been searching for hours but hasn't found an answer to it. What's the problem? My OS is kubuntu 12.04 and GPU is GeForce GTX 550i. Thanks again — Stone, Mar 13 '13 at 00:28
ubuntu 12.04 is not an [officially supported OS](http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#linux) for cuda 5. So there may be difficulties with it. You can try booting your OS into runlevel 3 instead of runlevel 5, [to prevent X from starting](http://docs.nvidia.com/cuda/cuda-gdb/index.html#single-gpu-debugging). That may help. — Robert Crovella, Mar 13 '13 at 00:35

score 2 · Answer 2 · answered Mar 12 '13 at 20:06

It is an out-of-range exception, not an out-of-memory one. This means, that you are accessing memory that is not part of any valid (i.e. statically or dynamically allocated) memory range. The most common reasons being that an offset into an array gets either too large or negative, or that a pointer has not been properly initialized.

As the message states, the error is caused inside grid::SetSubgridMarker(). However since kernel calls are asynchronous, the error cannot be reported before the next CUDA call, which happens to be cudaMemcpy().

CUDA runtime error: unspecified launch failure & Out-of-range Shared or Local Address

2 Answers2