Passing a row of pointers to global function

Question

I am trying to pass a row of pointers of a two dimensional array of pointers in CUDA. See my code below. Here the array of pointers is noLocal. Because I am doing an atomicAdd I am expecting a number different of zero in line printf("Holaa %d\n", local[0][0]);, but the value I get is 0. Could you help me to pass an arrow in CUDA by reference, please?

__global__ void myadd(int *data[8])
{
  unsigned int x = blockIdx.x;
  unsigned int y = threadIdx.x;
  unsigned int z = threadIdx.y;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  //printf("Ola sou a td %d\n", tid);
  for (int i; i<8; i++)
      atomicAdd(&(*data)[i],10);
}

int main(void)
{
  int local[20][8] = { 0 };
  int *noLocal[20][8];
  for (int d = 0; d< 20;d++) {
      for (int dd = 0; dd< 8; dd++) {
          cudaMalloc(&(noLocal[d][dd]), sizeof(int));
          cudaMemcpy(noLocal[d][dd], &(local[d][dd]), sizeof(int), cudaMemcpyHostToDevice);
          
      }
      myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
  }
  for (int d = 0; d< 20;d++)
      for (int dd = 0; dd < 8; dd++)
          cudaMemcpy(&(local[d][dd]), noLocal[d][dd], sizeof(int), cudaMemcpyDeviceToHost);
  printf("Holaa %d\n", local[0][0]);
  for (int d = 0; d < 20; d++)
      for (int dd = 0; dd < 8; dd++)
          cudaFree(noLocal[d][dd]);
}

score 2 · Answer 1 · answered Jul 28 '21 at 01:41

I believe you received good advice in the other answer. I don't recommend this coding pattern. For general reference material on creating 2D arrays in CUDA, see this answer.
When I compile the code you have shown, I get warnings of the form "i is used before its value is set". This kind of warning should not be ignored. It arises from this statement which doesn't make sense to me:
```
for (int i; i<8; i++)
```
that should be:
```
for (int i = 0; i<8; i++)
```
It's not clear you understand the C++ concepts of pointers and arrays. This:
```
int local[20][8] = { 0 };
```
represents an array of 20x8 = 160 integers. If you want to imagine it as an array of pointers, you could pretend that it includes 20 pointers of the form local[0], local[1]..local[19]. Each of those "pointers" points to an array of 8 integers. But there is no sensible comparison to suggest that it has 160 pointers in it. Furthermore the usage pattern you indicate in your kernel does not suggest that you expect 160 pointers to individual integers. But that is exactly what you are creating here:
```
int *noLocal[20][8];  //this is declaring a 2D array of 160 *pointers*
  for (int d = 0; d< 20;d++) {    // the combination of these loops means
  for (int dd = 0; dd< 8; dd++) { // you will create 160 *pointers*
    cudaMalloc(&(noLocal[d][dd]), sizeof(int));
```
To mimic your host array (local) you want to create 20 pointers each of which is pointing to an allocation of 8 int quantities. The usage in your kernel code here:
```
 &(*data)[i]
```
means that you intend to take a single pointer, and offset it by i values ranging from 0 to 7. It does not mean that you expect to receive 8 individual pointers. Again, this is C++ behavior, not unique or specific to CUDA.

In order to make your code "sensible" there were a variety of changes I had to make. Here's a "fixed" version:

$ cat t1858.cu
#include <cstdio>

__global__ void myadd(int data[8])
{
//  unsigned int x = blockIdx.x;
//  unsigned int y = threadIdx.x;
//  unsigned int z = threadIdx.y;
//  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  //printf("Ola sou a td %d\n", tid);
  for (int i = 0; i<8; i++)
      atomicAdd(data+i,10);
}

int main(void)
{
  int local[20][8] = { 0 };
  int *noLocal[20];
  for (int d = 0; d< 20;d++) {
          cudaMalloc(&(noLocal[d]), 8*sizeof(int));
          cudaMemcpy(noLocal[d], local[d], 8*sizeof(int), cudaMemcpyHostToDevice);

      myadd<<<20, dim3(10, 20)>>>(noLocal[d]);
  }
  for (int d = 0; d< 20;d++)
          cudaMemcpy(local[d], noLocal[d], 8*sizeof(int), cudaMemcpyDeviceToHost);
  printf("Holaa %d\n", local[0][0]);
  for (int d = 0; d < 20; d++)
          cudaFree(noLocal[d]);
}
$ nvcc -o t1858 t1858.cu
$ cuda-memcheck ./t1858
========= CUDA-MEMCHECK
Holaa 40000
========= ERROR SUMMARY: 0 errors
$

The number 40000 is correct. It comes about because every thread is doing an atomic add of 10, and you have 20x200 threads that are doing that. 10x20x200 = 40000.

score 1 · Answer 2 · answered Jul 27 '21 at 19:46

You should simply not be doing anything like that. You are wasting time and memory with these excessive allocations. And - your kernel would be pretty slow as well. I am 100% certain this is not what you were asked, nor what you wanted, to do.

Instead, you should:

Allocate a single large buffer on the device to fit the data you need.
Avoid using pointers on the device side, except to that buffer, unless absolutely necessary.
If you somehow have to use a 2D pointer array - add relevant offsets to your buffer's base pointer to get different pointers into it.

thanks @einpoklum, this is only for educational propose, really I know that I need to use offset pointers. But for now could you please help me to pass that row of pointers, please? — Juan, Jul 27 '21 at 22:52

Passing a row of pointers to __global__ function

2 Answers2

Passing a row of pointers to global function