cudaMemcpy returns cudaErrorInvalidValue when I try to copy from device

Question

I'm a beginner in C++ and in CUDA. I was trying to write a program that would calculate prime numbers. The algorithm itself works, but I can't get results from GPU: after kernel stops calculating I try to copy data back to host using cudaMemcpy, but it throws cudaErrorInvalidValue.

What I do: Because I don't know exactly how many prime numbers will be returned, I decided to make a struct that would allocate memory, store pointers to memory/used size, and would have an Add function for kernel.

template <typename T>
struct KernelArray
{
    T*  array = nullptr;
    int* size = nullptr;
    const int capacity;

    KernelArray(const int &capacity)
        : capacity(capacity)
    {
        checkCudaErrors(cudaMalloc(&array, capacity * sizeof(T))); // this all works
        checkCudaErrors(cudaMalloc(&size, sizeof(int)));
        checkCudaErrors(cudaDeviceSynchronize());
    }

    ~KernelArray()
    {
        checkCudaErrors(cudaFree(array));
        checkCudaErrors(cudaFree(size));
    }

    void CopyToDevice(const T* arr, const int &size) // this works too
    {
        if (size > capacity) throw std::invalid_argument("argument 'size' is bigger than allocated memory size");
        checkCudaErrors(cudaMemcpy(array, arr, size * sizeof(T), cudaMemcpyHostToDevice));
        checkCudaErrors(cudaMemcpy(this->size, &size, sizeof(int), cudaMemcpyHostToDevice));
        checkCudaErrors(cudaDeviceSynchronize());
    }

    int GetSizeFromDevice() const
    {
        int* toReturn = (int*)malloc(sizeof(int));
        checkCudaErrors(cudaMemcpy(toReturn, size, sizeof(int), cudaMemcpyDeviceToHost)); // there is an error
        checkCudaErrors(cudaDeviceSynchronize());
        int ret = *toReturn;
        free(toReturn);
        return ret;
    }

    /* ... */

    __device__ void dAdd(const T &a)
    {
        if (*size == capacity) return;
        array[(*size)++] = a;
    }
};

In main I precalculate some prime numbers to push them to the GPU, and pass them in kernel inside of a structure.

KernelArray<int> devPrimes(N / 4);
devPrimes.CopyToDevice(firstPrimes.data(), firstPrimes.size()); //from vector

findPrimesKernel <<<1, 1>>> (nSqrt, N, devPrimes);
auto cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
}
checkCudaErrors(cudaDeviceSynchronize());

int arrivedSize = devPrimes.GetSizeFromDevice(); // error when it steps inside here
int* arrivedArray = (int*)malloc(arrivedSize*sizeof(int));
devPrimes.CopyFromDevice(arrivedArray);

printf("last device primes: %d %d %d", arrivedArray[arrivedSize-3], arrivedArray[arrivedSize - 2], arrivedArray[arrivedSize - 1]);

Kernel code checks numbers, but changes primes array only using KernelArray::dAdd function. Kernel completes without errors.

So the code that causes error is this (the only error that happens):

int GetSizeFromDevice() const
{
    int* toReturn = (int*)malloc(sizeof(int));
    checkCudaErrors(cudaMemcpy(toReturn, size, sizeof(int), cudaMemcpyDeviceToHost)); // this line
    // CUDA error at .../kernel.cu:46 code=11(cudaErrorInvalidValue) "cudaMemcpy(toReturn, size, sizeof(int), cudaMemcpyDeviceToHost)"
    checkCudaErrors(cudaDeviceSynchronize());
    int ret = *toReturn;
    free(toReturn);
    return ret;
}

Kernel function:

_global__ void findPrimesKernel(int from, int to, KernelArray<int> primes)
{
    printf("Start..\n");
    for (int i = from+threadIdx.x; i <= to; i+=blockDim.x) 
    {
        for (int p = 0; p < *primes.size; ++p) 
        {
            auto prime = primes.array[p];
            if (i % prime == 0) {
                break; 
            }

            if (i < prime*prime)  
            {
                printf("[%d] found prime %d (size: %d, prime^2: %d^2)\n",threadIdx.x, i, *primes.size, prime);
                primes.dAdd(i);
                break;
            }

        }
    }
    printf("Done from %d to %d)\n", from, to);
}

Any ideas why it doesn't work, or what should I fix to make it work?

Thanks in advance!

how you call `dAdd` without race? can you show kernel function (`findPrimesKernel`)? — SRhm, Oct 13 '18 at 21:12
Have you defined a destructor for `KernelArray`? If so, does it do a `cudaFree` operation on `size`? It's better to provide a [mcve], which is expected for questions like this. See item 1 [here](https://stackoverflow.com/help/on-topic), note usage of the word "must" — Robert Crovella, Oct 13 '18 at 21:17
For now I use only 1 block, 1 thread. Yes, the struct has a destructor, with `cudaFree` and error check macros on both pointers. Updated the question.. Oh, is it because copy of the struct freed the memory? I'll check it now — Dmitry Frolov, Oct 13 '18 at 21:28
When you pass the `devPrimes` object to the kernel, it is as if you passed it to a C++ function. The function uses pass-by-value to provide a **copy** of the object for use within the function (kernel). When the function (kernel) completes, the copy of the object goes out-of-scope, and so the destructor is called on the object-copy. Since your destructor frees `size`, it is no longer usable (even in the original object) by `cudaMemcpy`, so you get an invalid argument. This is mostly a misunderstanding of C++, not unique/specific to CUDA. This is a duplicate of other questions here. — Robert Crovella, Oct 13 '18 at 21:32

cudaMemcpy returns cudaErrorInvalidValue when I try to copy from device

0 Answers0