CUDA can no longer copy data from device to host after a "bad" call to a function

Question

I'm testing a code in which a kernel is meant to perform a simple sum between two values stored in two pointers.

After a call to the kernel "add" I can no longer copy the pointers' data from host to device and from there to host again, even when no operations were performed over the pointers in the kernel. But when I comment the statement in which the function is called, I get the correct results. Here is the code:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

__global__ void add(int *a, int *b, int *c)
{
*c = *a - *b;
}

int main(void)
{
int result, x_val, y_val; //Store data from device to host in this vars.
int *x_host, *y_host; //Pointers in host
int *tempGPU, *x_dev, *y_dev; //Pointers in device

x_host = (int *)malloc(sizeof(int));
y_host = (int *)malloc(sizeof(int));

*x_host = 8;
*y_host = 4;

x_val = -5;
y_val = -10;

printf("\n x = %d, y = %d\n", *x_host, *y_host);

cudaMalloc( (void **)&tempGPU, sizeof(int) );

//It's wrong to pass this arguments to the function. The problem is in this statement.
add<<<1,1>>> (x_host, y_host, tempGPU);

cudaMemcpy(&result, tempGPU, sizeof(int), cudaMemcpyDeviceToHost);

printf("\n x_host - y_host = %d\n", result);

cudaMalloc( (void **)&x_dev, sizeof(int) );
cudaMalloc( (void **)&y_dev, sizeof(int) );

*x_host = 6;
*y_host = 20;

cudaMemcpy(x_dev, x_host, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(y_dev, y_host, sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(&x_val, x_dev, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&y_val, y_dev, sizeof(int), cudaMemcpyDeviceToHost);

printf("\n x_host = %d, y_host = %d\n", *x_host, *y_host);
printf("\n x_val = %d, y_val = %d\n", x_val, y_val);

cudaFree( tempGPU );

printf( "\nCUDA: %s\n", cudaGetErrorString(cudaGetLastError()) );

return 0;

}

I know that the function is expecting pointers allocated in the device, but why such a mistake don't allow me to use cudaMemcpy properly? Why when I comment the line:

add<<<1,1>>> (x_host, y_host, tempGPU);

I get the correct results. Thanks.

Your problem is that `x_host` and `y_host` are pointers to host memory spaces. The `__global__ add` function expects pointers to device memory space. As you have constructed your code, add will wrongly interpret `x_host` and `y_host` as device memory pointers. — Vitality, Feb 08 '14 at 07:56
You don't check for errors. And such incorrect reasoning originates from there. Your `add` kernel fails to run properly but the show goes on because error is not caught until `cudaMemcpy`. Please have a look at [this](http://stackoverflow.com/q/14038589/2386951). — Farzad, Feb 08 '14 at 08:01

score 2 · Accepted Answer · edited May 23 '17 at 10:25

Your problem is that x_host and y_host are pointers to host memory spaces. The __global__ add function expects pointers to device memory space. As you have constructed your code, add will wrongly interpret x_host and y_host as device memory pointers.

As noticed by Farzad, your could have been spotting the mistake by yourself through a proper CUDA error checking in the sense of What is the canonical way to check for errors using the CUDA runtime API?.

Below is your code fixed with proper CUDA error checking.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { exit(code); getchar(); }
    }
}

__global__ void add(int *a, int *b, int *c)
{
    *c = *a - *b;
}

int main(void)
{
    int* x_host = (int*)malloc(sizeof(int));
    int* y_host = (int*)malloc(sizeof(int));

    *x_host = 8;
    *y_host = 4;

    int* tempGPU;   gpuErrchk(cudaMalloc((void**)&tempGPU,sizeof(int)));
    int* x_dev;     gpuErrchk(cudaMalloc((void**)&x_dev,  sizeof(int)));
    int* y_dev;     gpuErrchk(cudaMalloc((void**)&y_dev,  sizeof(int)));

    gpuErrchk(cudaMemcpy(x_dev, x_host, sizeof(int), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(y_dev, y_host, sizeof(int), cudaMemcpyHostToDevice));

    int result; 

    add<<<1,1>>> (x_dev, y_dev, tempGPU);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&result, tempGPU, sizeof(int), cudaMemcpyDeviceToHost));

    printf("\n x_host - y_host = %d\n", result);

    gpuErrchk(cudaFree(x_dev));
    gpuErrchk(cudaFree(y_dev));
    gpuErrchk(cudaFree(tempGPU));

    getchar();

    return 0;

}

Yes, I intentionally called `add` with pointers to host memory spaces, let's say, to see what happens. So, **without checking for CUDA errors**, an error in the CUDA runtime API will result in a "disable" or "corruption", so to speak, in subsequent calls to CUDA's functions? And that is why in my code example, `cudaMemcpy` gives me wrong results? — Granados, Feb 08 '14 at 19:47
@LeonelG Your code with wrong arguments for the `add` function gets simply stuck and no `cudaMemcpy` is executed afterwards on my system. — Vitality, Feb 09 '14 at 08:52

CUDA can no longer copy data from device to host after a "bad" call to a function

1 Answers1