0

So I see a parent question about how to copy from host to the constant memory on GPU using cudaMemcpyToSymbol.

My question is how to do the reverse, copying from device constant memory to the host using cudaMemcpyFromSymbol.

In the following minimal reproducible example, I either got

  • 1) invalid device symbol error using cudaMemcpyFromSymbol(const_d_a, b, size);, or
  • 2) got segmentation fault if I use cudaMemcpyFromSymbol(&b, const_d_a, size, cudaMemcpyDeviceToHost).

I have consulted with the manual which suggests I code as in 1), and this SO question that suggests I code as in 2). Neither of them work here.

Could anyone kindly help suggesting a workaround with this? I must be understanding something improperly... Thanks!

Here is the code:

// a basic CUDA function to test working with device constant memory
#include <stdio.h>
#include <cuda.h>

const unsigned int N = 10;    // size of vectors

__constant__ float const_d_a[N * sizeof(float)];  

int main()
{
    float * a, * b;  // a and b are vectors. c is the result
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    /**************************** Exp 1: sequential ***************************/
    int i;
    int size = N * sizeof(float);
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
    }


    // 1. copy a to constant memory
    cudaError_t err = cudaMemcpyToSymbol(const_d_a, a, size);
    if (err != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__);
        exit(EXIT_FAILURE);
    }

    cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
        exit(EXIT_FAILURE);
    }

    double checksum0, checksum1;
    for (i = 0; i < N; i++){
        checksum0 += a[i];
        checksum1 += b[i];
    }

    printf("Checksum for elements in host memory is %f\n.", checksum0);
    printf("Checksum for elements in constant memory is %f\n.", checksum1);

    return 0;
}
yuqli
  • 4,461
  • 8
  • 26
  • 46

1 Answers1

1

In CUDA, the various cudaMemcpy* operations are modeled after the C standard library memcpy routine. In that function, the first pointer is always the destination pointer and the second pointer is always the source pointer. That is true for all cudaMemcpy* functions as well.

Therefore, if you want to do cudaMemcpyToSymbol, the symbol had better be the first (destination) argument passed to the function (the second argument would be a host pointer). If you want to do cudaMemcpyFromSymbol, the symbol needs to be the second argument (the source position), and the host pointer is the first argument. That's not what you have here:

cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
                                          ^        ^
                                          |       This should be the symbol.
                                          |  
                                   This is supposed to be the host destination pointer.

You can discover this with a review of the API documentation.

If we reverse the order of those two arguments in that line of code:

cudaError_t err2 = cudaMemcpyFromSymbol(b, const_d_a, size);

Your code will run with no errors and the final results printed will match.

There is no need to use an ampersand with either of the a or b pointers in these functions. a and b are already pointers. In the example you linked, pi_gpu_h is not a pointer. It is an ordinary variable. To copy something to it using cudaMemcpyFromSymbol, it is necessary to take the address of that ordinary variable, because the function expects a (destination) pointer.

As an aside, this doesn't look right:

__constant__ float const_d_a[N * sizeof(float)];  

This is effectively a static array declaration, and apart from the __constant__ decorator it should be done equivalently to how you would do it in C or C++. It's not necessary to multiply N by sizeof(float) here, if you want storage for N float quantities. Just N by itself will do that:

__constant__ float const_d_a[N];

however leaving that as-is does not create problems for the code you have posted.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks for the timely reponse Robert! Very clear, I now understand where I am wrong. For the example I linked, the user suggests we use `cudaMemcpyFromSymbol(&pi_gpu_h, pi_gpu, sizeof(double), cudaMemcpyDeviceToHost)` and I suppose this interface is already dated? As we do not need the final parameter. – yuqli Dec 02 '18 at 04:22
  • Thanks also for pointing out the extra `sizeof(float)` specification I had in array declaration. This likely solves another bug I'm having in another chunk of code. I mistaken this array declaration with a `malloc` function where I need to specify the exact number of bytes. Thanks! – yuqli Dec 02 '18 at 04:24
  • If you study the documentation I linked, you will find that there are default arguments here. A default argument on a function is not a CUDA specific concept, you can learn about it through study of C++. Since there are default arguments, it is acceptable to supply them, or omit them. However the example you linked does contain an error. It has to do with the order of the supplied arguments. I'm not going to try and explain it in the space of comments, however. – Robert Crovella Dec 02 '18 at 04:24
  • I've rectified the answer in the question you linked. The nature of the error is described [here](https://stackoverflow.com/questions/24729746/c-skipping-first-default-parameter-in-call) – Robert Crovella Dec 02 '18 at 04:39
  • Thanks for the tips! I now understand 1) the `cudaMemcpyDeviceToHost` parameter is set as default and 2) the original answer missed an offset parameter and thus the order is problematic. – yuqli Dec 03 '18 at 01:07