cudaMemcpyAsync giving "invalid memory access" error

Question

In my employment's codebase, I'm trying to debug a "invalid memory access" error from cudaMemcpyAsync.

The function call is

CHECK_CUDA( cudaMemcpyAsync(A, B, sizeof(B), cudaMemcpyDeviceToHost, stream) )

where A and B are both int*, but B is allocated on the device with

cudaMalloc((void**) &B, sizeof(B))

When it says invalid memory access, what is it trying to access that is invalid? How can I find out what is being inapropriately accessed?

`sizeof B` seems like a bad choice. That's the size of a pointer, but `cudaMalloc`, like CPU `malloc`, needs the size of the thing (which will be) pointed-to. — Ben Voigt, Oct 06 '22 at 19:11
@BenVoigt yeah that seemed odd to me too, and I'm not sure why it was set like that. I changed it to `sizeof(int)` (think this gave the same result as `sizeof(int*)` on my system though) and the error still persists — roulette01, Oct 06 '22 at 19:23
It's doubtful that `sizeof(int)` and `sizeof(int*)` are the same. That would imply a 32-bit OS, which is "unlikely" but I cannot completely rule out the possibility. Neither of those choices however are responding in a proper way to the suggestion given. If `B` is a pointer, you want to allocate for how many objects of type of `B`, that `B` is (or should be) pointing to. As indicated, that is usually something like `sizeof(B[0])*number_of_items`. — Robert Crovella, Oct 06 '22 at 19:31
@RobertCrovella oops, you're right. I was printing the wrong thing. they are, in fact, not the same on my system (64-bit). I don't actually see where `B` is allocated. I only see `cudaMalloc((void**) &B, sizeof(B))`. The codebase is actually based off of this nvidia example: https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/spmma/spmma_example.cpp where `d_valid` in this example is what I refer to as `B` and `is_valid` is `A` — roulette01, Oct 06 '22 at 19:41
The example's trash, at least with respect to how it uses `sizeof`. (`sizeof` "some pointer") is wrong in both `malloc` and `memcpy` usage. — Ben Voigt, Oct 06 '22 at 20:09
@BenVoigt do you know what it should be in the cudaMalloc call? — roulette01, Oct 06 '22 at 20:59
Based on https://docs.nvidia.com/cuda/cusparselt/functions.html#cusparseltspmmaprunecheck it seems to be a single scalar value, so it should be `(sizeof *d_valid)`. But I don't know if your code is using `B` in the same way. — Ben Voigt, Oct 06 '22 at 21:10
Thanks for reporting the error in the sample code. I have filed an internal bug 3823512 at NVIDIA to have it looked at. Meanwhile: 1. you've already been given instructions for how to fix it 2. I don't think its the source of the "invalid memory access" error, for the reasons I state in my answer. Naturally, since you have not provided a [mcve], my position here is tentative. That is my considered opinion of the situation, nothing more. — Robert Crovella, Oct 06 '22 at 22:26
Also note that all usages of `sizeof(d_valid)` in that code are suspect, and should be rectified, at the same time, in a similar way. Replace all of them with, for example, `sizeof(*d_valid)`. There are multiple instances of this error in that code, all must be changed. — Robert Crovella, Oct 06 '22 at 22:28
`sizeof(*d_valid)` has to be multiplied with the number of elements, too, if you want to allocate more than one element, see Robert's first comment for an example. — Sebastian, Oct 07 '22 at 06:50
the incorrect usage around `d_valid` [here](https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSELt/spmma/spmma_example.cpp) has been rectified. — Robert Crovella, Oct 13 '22 at 20:20

Robert Crovella · Answer 1 · 2022-10-06T19:28:31.547

The invalid memory access error does not actually refer to the cudaMemcpyAsync operation. So studying that alone will be unlikely to yield anything useful.

CUDA uses an asynchronous reporting mechanism to report device code execution errors "at the next opportunity" via the host API. So the error you are seeing could refer to any kernel execution that took place prior to that call.

To help localize the error, you can try specifying launch blocking when you run your code. The usefulness of this will probably depend on exactly how the code is written, and whether any sort of error checking is being done after CUDA kernel launches. If you compile your code with --lineinfo, or even if you don't, you can get additional localization information about the problem using the method indicated here.

The observation in the comment is a good one, and is perhaps an important clue to coding defects. I will note that:

albeit curious, as posted, the transfer size is consistent with the allocation size, so the operation itself is unlikely to be throwing an error for that reason
based on my experience with CUDA error reporting (i.e. familiarity with error codes and their text translations) the "invalid memory access" error is attributable to a device code execution error. If the CUDA runtime can determine that a given transfer size is inconsistent with an allocation size, the error given will be "invalid argument".

You can take a look at section 12 in this online training series to get a more in-depth treatment of CUDA error reporting, as well as debugging suggestions.

ahh I see. I'm not sure how this is setup, but my codebase also tells me what line number the error originated from, and the line number given corresponds to the `cudaMemcpyAsync` call. does this change what you said about "The invalid memory access error does not actually refer to the cudaMemcpyAsync operation" ? — roulette01, Oct 06 '22 at 19:24
No it does not. That API call **is actually reporting the error** and so the line number is not in any way false or incorrect. But the error is not as a result of any operation performed by `cudaMemcpyAsync`. It is a report from a previous operation. Yes, CUDA error reporting can be confusing. It is partly arising from the asynchronous nature of host and device code execution. — Robert Crovella, Oct 06 '22 at 19:26

score 0 · Answer 2 · answered Oct 08 '22 at 19:51

I recommend against debugging a CUDA program using CUDA error codes due to the asynchronous nature of CUDA (see Robert's answer).

NVIDIA provides users with a tool called compute-sanitizer (shipped as part of the CUDA toolkit) that locates these memory issues for you. Example below:

$ cat test.cu
__global__ void kernel(int *ptr)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    ptr[i] = i;
}

int main(void)
{
    int* ptr;
    cudaMalloc(&ptr, 63 * sizeof(int)); // Correct size should be `64 * sizeof(int)`
    kernel<<<1, 64>>>(ptr);
}
$ nvcc -o test test.cu
$ compute-sanitizer --show-backtrace=device ./test
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at 0x70 in kernel(int *)
=========     by thread (63,0,0) in block (0,0,0)
=========     Address 0x7fc8efe000fc is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7fc8efe00000 of size 252 bytes
=========
========= ERROR SUMMARY: 1 error

cudaMemcpyAsync giving "invalid memory access" error

2 Answers2