3

I am trying to apply a kernel function on a __device__ variable, which, according to the specs, resides "in global memory"

#include <stdio.h>
#include "sys_data.h"
#include "my_helper.cuh"
#include "helper_cuda.h"
#include <cuda_runtime.h>


double X[10] = {1,-2,3,-4,5,-6,7,-8,9,-10};
double Y[10] = {0};
__device__ double DEV_X[10];


int main(void) {
    checkCudaErrors(cudaMemcpyToSymbol(DEV_X, X,10*sizeof(double)));
    vector_projection<double><<<1,10>>>(DEV_X, 10);
    getLastCudaError("oops");
    checkCudaErrors(cudaMemcpyFromSymbol(Y, DEV_X, 10*sizeof(double)));
    return 0;
}

The kernel function vector_projection is defined in my_helper.cuh as follows:

template<typename T> __global__ void vector_projection(T *dx, int n) {
    int tid;
    tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < n) {
        if (dx[tid] < 0)
            dx[tid] = (T) 0;
    }
}

As you can see, I use cudaMemcpyToSymbol and cudaMemcpyFromSymbol to transfer data to and from the device. However, I'm getting the following error:

CUDA error at ../src/vectorAdd.cu:19 code=4(cudaErrorLaunchFailure) 
  "cudaMemcpyFromSymbol(Y, DEV_X, 10*sizeof(double))" 

Footnote: I can of course avoid to use __device__ variables and go for something like this which works fine; I just want to see how to do the same thing (if possible) with __device__ variables.

Update: The output of cuda-memcheck can be found at http://pastebin.com/AW9vmjFs. The error messages I get are as follows:

========= Invalid __global__ read of size 8
=========     at 0x000000c8 in /home/ubuntu/Test0001/Debug/../src/my_helper.cuh:75:void vector_projection<double>(double*, int)
=========     by thread (9,0,0) in block (0,0,0)
=========     Address 0x000370e8 is out of bounds
Pantelis Sopasakis
  • 1,902
  • 5
  • 26
  • 45
  • Your `vector_projection` kernel is failing during execution. Some types of kernel issues will be caught by your `getLastCudaError` call. Others may not show up until the next sync point, which would be the `cudaMemcpyFromSymbol`. The docs indicate these calls can return errors from previous async activity. Try running your code with `cuda-memcheck`. If you do cuda error checking as outlined [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) you'll get a more explicit indication that the problem is with the kernel. – Robert Crovella Sep 27 '14 at 15:13
  • Thanks @RobertCrovella. Indeed, it seems there is an issue with my kernel; see http://pastebin.com/AW9vmjFs. Do I need to allocate DEV_X before I call `cudaMemcpyToSymbol`? I can't figure out what the problem might be... – Pantelis Sopasakis Sep 27 '14 at 15:21

1 Answers1

7

The root of the problem is that you are not allowed to take the address of a device variable in ordinary host code:

vector_projection<double><<<1,10>>>(DEV_X, 10);
                                    ^

Although this seems to compile correctly, the actual address passed is garbage.

To take the address of a device variable in host code, we can use cudaGetSymbolAddress

Here is a worked example that compiles and runs correctly for me:

$ cat t577.cu
#include <stdio.h>

double X[10] = {1,-2,3,-4,5,-6,7,-8,9,-10};
double Y[10] = {0};
__device__ double DEV_X[10];

template<typename T> __global__ void vector_projection(T *dx, int n) {
    int tid;
    tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < n) {
        if (dx[tid] < 0)
            dx[tid] = (T) 0;
    }
}



int main(void) {
    cudaMemcpyToSymbol(DEV_X, X,10*sizeof(double));
    double *my_dx;
    cudaGetSymbolAddress((void **)&my_dx, DEV_X);
    vector_projection<double><<<1,10>>>(my_dx, 10);
    cudaMemcpyFromSymbol(Y, DEV_X, 10*sizeof(double));
    for (int i = 0; i < 10; i++)
      printf("%d: %f\n", i, Y[i]);
    return 0;
}
$ nvcc -arch=sm_35 -o t577 t577.cu
$ cuda-memcheck ./t577
========= CUDA-MEMCHECK
0: 1.000000
1: 0.000000
2: 3.000000
3: 0.000000
4: 5.000000
5: 0.000000
6: 7.000000
7: 0.000000
8: 9.000000
9: 0.000000
========= ERROR SUMMARY: 0 errors
$

This is not the only way to address this. It is legal to take the address of a device variable in device code, so you could modify your kernel with a line something like this:

T *dx = DEV_X;

and forgo passing of the device variable as a kernel parameter. As suggested in the comments, you could also modify your code to use Unified Memory.

Regarding error checking, if you deviate from proper cuda error checking and are not careful in your deviations, the results may be confusing. Most cuda API calls can, in addition to errors arising from their own behavior, return an error that resulted from some previous CUDA asynchronous activity (usually kernel calls).

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks a lot. Indeed it worked. I discovered that the modifier `__managed__` before `DEV_X` also solves the problem for the same reason as you explained in your answer. In terms of performance, how does the use of `__device__` variables compares to something like this (variables are declared within the scope of the `main` function): http://pastebin.com/rx9nUnGX ? – Pantelis Sopasakis Sep 27 '14 at 16:13
  • 1
    Speaking *purely about the performance of device code*, there should be no significant difference in code performance regardless of how the device pointer was created, whether statically (using `__device__`), dynamically (using `cudaMalloc`), or via UM whether statically (`__managed__ __device__`) or dynamically (using `cudaMallocManaged`). – Robert Crovella Sep 27 '14 at 16:19