2

i am quite new to CUDA and I have a question regarding the memory management for an object. I have an object function to load the data to the device and if another object function is called the computation is carried out.

I have read some parts of the NVIDIA programming guide and some SO questions but they do data copying and computing in a single function so there no need of multiple functions.

Some more specifications: The data is read one time. I do not know the data size at compile time therefore I need a dynamic allocation. My current device has a compute capability of 2.1 (will be updated soon to 6.1).

I want to copy the data in a first function and use the data in a different function. For example:

__constant__ int dev_size;
__device__ float* dev_data; //<- not sure about this

/* kernel */
__global__ void computeSomething(float* dev_output)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < dev_size)
    {
        dev_output[idx] = dev_data[idx]*100; // some computation;
    }
}

// function 1
void OBJECT::copyVolumeToGPU(int size, float* data)
{
    cudaMalloc(&dev_data, size * sizeof(float));
    cudaMemcpy(dev_data, data, size * sizeof(float), cudaMemcpyHostToDevice );
    cudaMemcpyToSymbol(dev_size, size, sizeof(int));
}

// function 2
void OBJECT::computeSmthOnDevice(int size)
{
    // allocate output array
    auto host_output =  new float[size];
    float* dev_output;
    cudaMalloc(&dev_output, size * sizeof(float));

    int block = 256;
    int grid = ceil(size/block);
    computeSomething<<<grid,block>>>(dev_output);

    cudaMemcpy(host_output, dev_data, size * sizeof(float), cudaMemcpyDeviceToHost);

    /* ... do something with output ... */

    delete[] host_output;
    cudaFree(dev_output);
}

gpuErrChk is carried out this way: https://stackoverflow.com/a/14038590/3921660 but omitted in this example.

Can I copy the data using a __device__pointer (like __device__ float* dev_data;)?

Community
  • 1
  • 1
KabCode
  • 766
  • 8
  • 25
  • Could you try and sketch out what you mean in code? Because it isn't very easy to understand what you are trying to ask here. – talonmies May 19 '17 at 12:58
  • Welcome to SO. Please read this [how-to-ask](http://stackoverflow.com/help/how-to-ask) for improving your question. – thewaywewere May 19 '17 at 13:07

1 Answers1

4

Generally, your idea is workable, but this:

cudaMalloc(&dev_data, size * sizeof(float));

is not legal. It is not legal to take an address of a __device__ item in host code. So if you know the size at compile time, the easiest approach is to convert this to a static allocation e.g.

__device__ float dev_data[1000]; 

If you really want to make this a dynamically allocated __device__ pointer, then you will need to use a method such as described here, which involves using cudaMalloc on a typical device pointer in host code that is a "temporary", then copy that "temporary" pointer to the __device__ pointer via cudaMemcpyToSymbol. And then when you want to copy data to/from that particular allocation via cudaMemcpy, you would use cudaMemcpy to/from the temporary pointer from host code.

Note that for the purposes of "communicating" data from one function to the next, or one kernel to the next, there's no reason you couldn't just use a dynamically allocated pointer from cudaMemcpy, and pass that pointer around to wherever you need it. You can even pass it via a global variable to any host function that needs it, like an ordinary global pointer. For kernels, however, you would still need to pass such a global pointer to the kernel via kernel argument.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I am trying to implement your suggestions. Can you explain your last sentence. I didn't get what you mean by "pass via command line argument". – KabCode May 22 '17 at 09:12
  • sorry, that was a bad choice of words. I edited. Just intended that you should pass the pointer as an argument to the kernel explicitly. – Robert Crovella May 22 '17 at 14:21