CUDA P2P memory access and constant memory

Question

I cannot find the answer anywhere and I may have overlooked it but it seems that one cannot use __constant__ memory (along with cudaMemcpyToSymbol) and peer-to-peer access with UVA.

I've tried the simpleP2P nvidia sample code which works fine on the 4 NV100 with nvlink I have, but as long as I declare the factor 2 in the kernel as:

__constant__ float M_; // in global space

float M = 2.0;
cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);

the results are basically zero. If I define it using C preprocessor (e.g. #define M_ 2.0), it works fine.

So I'm wondering, is that true or am I doing something wrong? and are there any other kind of memory that also cannot be accessed this way (texture memory for example)?

Sorry, I meant that the value returned by the kernel for this parameter is zero when executed on a device where the `cudaMemcpyToSymbol` instruction was not instanciated. Which is the case if the **constant memory** is not allocated in **global memory**. — salvaS, Mar 11 '20 at 13:51
You would need to run a `cudaMemcpyToSymbol` operation on each GPU where you intend to use the `__constant__` value. P2P doesn't change this. `__constant__` memory is a per-GPU resource. When you create a global `__constant__` entity, it is (separately) instantiated on each GPU visible to the CUDA runtime. Therefore each separate copy (one per GPU) must be initialized, if you intend to use it on that GPU. Again, P2P doesn't affect this at all. If you attempt to reference a constant symbol, it will use the local (per-GPU) copy. — Robert Crovella, Mar 11 '20 at 16:00

score 1 · Answer 1 · answered Mar 10 '20 at 15:33

The relation between your question of why "the results are basically zero" and P2P access with UVA is not immediately clear to me.

is that true or am I doing something wrong?

It's hard to say as your question is a bit vague and no complete example is shown.

__constant__ float M_ allocates a variable M_ on the constant memory of all CUDA visible devices. In order to set the value on multiple devices you should do something like:

__constant__ float M_; // <= This declares M_ on the constant memory of all CUDA visible devices

__global__ void showMKernel() {
    printf("****** M_ = %f\n", M_);
}

int main()
{

float M = 2.0;

 // Make sure that the return values are properly checked for cudaSuccess ...

int deviceCount = -1;
cudaGetDeviceCount(&deviceCount);

// Set M_ on the constant memory of each device:
for (int i = 0; i < deviceCount; i++) {
  cudaSetDevice(i);
  cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);
}

// Now, run a kernel to show M_:
for (int i = 0; i < deviceCount; i++) 
{
  cudaSetDevice(i);
  printf("Device %g :\n", i);
  showMKernel<<<1,1>>>();
  cudaDeviceSynchronize();
}

}

which returns:

Device 0 :
****** M = 2.000000
Device 1 :
****** M = 2.000000
// so on for other devices

Now, if I replace

// Set M_ on the constant memory of each device:
for (int i = 0; i < deviceCount; i++) {
  cudaSetDevice(i);
  cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);
}

with

cudaMemcpyToSymbol(M_, &M, sizeof(float), 0, cudaMemcpyDefault);

this will only set the value of M_ on the active device and therefore returns

Device 0 :
****** M = 2.000000
Device 1 :
****** M = 0.000000 // <= I assume this is what you meant by 'the results are basically zero'
// M = 0 for other devices too

are there any other kind of memory that also cannot be accessed this way (texture memory for example)?

Again I'm not entirely sure what this way is. I think generally you cannot access the constant memory or the texture memory of one device from any other devices, though I am not 100% certain.

UVA assigns one address space for CPU and GPU memories such that memory copying between host and the global memory of multiple devices become easily accessible through the use of cudaMemcpy with kind cudaMemcpyDefault.

Also, P2P communication between devices allows for direct accesses and transfers of data between the global memory of multiple devices.

Similar to the __constant__ example above, when you declare a texture like texture <float> some_texture, some_texture will be defined for each visible device, however you would need to explicitly bind some_texture to your texture reference on each device when working with multiple devices.

Well, when I say basically zero, I meant that indeed when the `cudaMemcpyToSymbol ` is not executed on the specified device, the value of the constant parameter is zero. The value is the right one on the device where it was allocated. But if you look into the memory design, it seems that `__constant__` memory is located in global memory and then cached. The same for **texture memory**. That goes, in my understanding, against the concept of P2P memory usage. Again, I did not see it anywhere in the manual or something else about P2P. — salvaS, Mar 11 '20 at 13:45

CUDA P2P memory access and __constant__ memory

1 Answers1

CUDA P2P memory access and constant memory