-1

I am trying to get a better grasp of memory management in cuda. There is Something that is just now occurring to me as a major lack of understanding. How do kernels access values that, as I understand it, should be in host memory.

When vectorAdd() is called, it runs the function on the device. But only the elements are stored on the device memory. the length of the vectors are stored on the host. How is it that the kernel does not exit with an error from trying to access foo.length, something that should be on the host.

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct{
    float *elements;
    int length;
}vector;

__global__ void vectorAdd(vector foo, vector bar){
    int idx = threadIdx.x + blockDim.x * blockId.x.x;
    if(idx < foo.length){                      //this is the part that I do not understand
        foo.elements[idx] += bar.elements[idx];
    }
}

int main(void){
    vector foo, bar;
    foo.length = bar.length = 50;
    cudaMalloc(&(foo.elements), sizeof(float)*50);
    cudaMalloc(&(bar.elements), sizeof(float)*50);
    //these vectors are empty, so adding is just a 0.0 += 0.0
    int blocks_per_grid = 10;
    int threads_per_block = 5;
    vectorAdd<<<blocks_per_grid, threads_per_block>>>(foo, bar);
    return 0;
}
Rory Grice
  • 101
  • 2

1 Answers1

1

In C and C++, a typical mechanism for making arguments available to the body of a function call is pass-by-value. The basic idea is that a separate copy of the arguments are made, for use by the function.

CUDA claims compliance to C++ (subject to various limitations), and it therefore provides a mechanism for pass-by-value. On a kernel call, the CUDA compiler and runtime will make copies of the arguments, for use by the function (kernel). In the case of a kernel call, these copies are placed in a particular area of __constant__ memory which is in the GPU and within GPU memory space, and therefore "accessible" to device code.

So, in your example, the entire structures passed as the arguments for the parameters vector foo, vector bar are copied to GPU device memory (specifically, constant memory) by the CUDA runtime. The CUDA device code is structured in such a way by the compiler to access these arguments as needed directly from constant memory.

Since those structures contain both the elements pointer and the scalar quantity length, both items are accessible in CUDA device code, and the compiler will structure references to them (e.g. foo.length) so as to retrieve the needed quantities from constant memory.

So the kernels are not accessing host memory in your example. The pass-by-value mechanism makes the quantities available to device code, in GPU constant memory.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257