What is the recommended way to copy cuda pitched memory to a 1D array (device to device)?

Question

I consider only device memory within the same device.

Say I have a memory object created with cudaMalloc3D (with 1 as third parameter, so it is as 2D array) which give me a cudaPitchedPtr object (with width and height of xsize and ysize); but I need to use the content as a 1D array as created by cudaMalloc (referenced as T* if each element is of type T, and with xsize*ysize*sizeof(T) bytes).

Is there any recommended way to to this with one function call (something like cudaMemcpy2DToArray, but that would have a T* 1D array as destination) ? If not should I do the copy line by line (which I want to avoid) ?

Robert Crovella · Answer 1 · 2021-08-01T23:46:32.867

cudaMemset2D doesn't create anything. Like memset it sets an already created memory object to particular byte values.

The API that creates a pitched allocation is cudaMallocPitch. (can also use cudaMalloc3D).

If you have an allocation created with cudaMallocPitch, then the correct API to use is cudaMemcpy2D (assuming cudaArray is not involved which seems to be the case here). (for cudaMalloc3D you could use cudaMemcpy3D)

cudaMemcpy2D can copy from a pitched or unpitched allocation to a pitched or unpitched allocation. To copy to an unpitched (flat/linear) allocation, you simply set the pitch value in the destination equal to the width of the copied line (in bytes, for both).

There are numerous questions here on the cuda tag that demonstrate various usages of cudaMallocPitch/cudaMemcpy2D. Here is one that matches your description:

int main(){

  int *a, *b;
  size_t pitch;
  const size_t width = 32;
  const size_t height = 1024;
  cudaMallocPitch(&a, &pitch, width*sizeof(a[0]), height);
  cudaMalloc(&b, width*height*sizeof(b[0]));
  cudaMemcpy2D(b, width*sizeof(b[0]), a, pitch, width*sizeof(a[0]), height, cudaMemcpyDeviceToDevice);
}

If you were starting with a pre-existing cudaPitchedPtr, it could be like this:

int main(){

  int *b;
  cudaPitchedPtr a;
  ...
  size_t pitch = a.pitch
  const size_t width = a.xsize;
  const size_t height = a.ysize;
  cudaMalloc(&b, width*height*sizeof(b[0]));
  cudaMemcpy2D(b, width*sizeof(b[0]), a.ptr, pitch, width*sizeof(a.ptr[0]), height, cudaMemcpyDeviceToDevice);
}

Here is a verification:

$ cat t1861.cu
#include <stdio.h>
const size_t width = 32;
const size_t height = 1024;
__global__ void k1(int *a, size_t pitch){
  char *ca = (char *)a;
  int **ia = (int **)&ca;
  for (int i = 0; i < height; i++){
    for (int j = 0; j < width; j++)
      (*ia)[j] = i*width+j;
    ca += pitch;
  }
}

__global__ void k2(int *b){
  for (int i = 0; i < 5; i++){
    for (int j = 0; j< width; j++) printf("%d ", b[i*width+j]);
    printf("\n");}
}

int main(){

  int *a, *b;
  size_t pitch;
  cudaMallocPitch(&a, &pitch, width*sizeof(a[0]), height);
  cudaMalloc(&b, width*height*sizeof(b[0]));
  k1<<<1,1>>>(a, pitch);
  cudaMemcpy2D(b, width*sizeof(b[0]), a, pitch, width*sizeof(a[0]), height, cudaMemcpyDeviceToDevice);
  k2<<<1,1>>>(b);
  cudaDeviceSynchronize();
  cudaPitchedPtr c = make_cudaPitchedPtr(a, width * sizeof(a[0]), width, height);
  printf("%lu\n", c.xsize);
}
$ nvcc -o t1861 t1861.cu
$ ./t1861
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
32
$

For the description that is in your question at the moment, nothing really changes. You would simply extract the appropriate parameters from your `cudaPitchedPtr` [structure](https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaPitchedPtr.html#structcudaPitchedPtr). — Robert Crovella, Jul 30 '21 at 18:41
Don't you think that destination pitch should be `width*height*sizeof(b[0])` rather than the stride ? Since it's a 1D array. — Soleil, Jul 30 '21 at 22:24
no, I don't think so. pitched copying can be thought of as a line-by-line copy. After each line is copied, both pointers are updated by their respective pitches. Anyway, why not give it a try and see which one works correctly? You won't permanently break anything. You're not playing with dynamite here. — Robert Crovella, Jul 30 '21 at 22:44
Robert, I'm having a day in a cloud of dynamite ! Also, I believe that `a.xsize` unit is bytes, therefore we should have `const size_t width = a.xsize / sizeof(T)` where T is the type of element in the `cudaPitchedPtr`. It's not visible if T is 1 byte, but with T being float for instance, then, it explodes indeed. — Soleil, Jul 30 '21 at 22:59
The reason I have this is because I create the `cudaPitchedPtr` this way: ` cudaPitchedPtr pitched{}; const cudaExtent extent = make_cudaExtent(width * sizeof(T), height, colors); CheckCudaErrors(cudaMalloc3D(&pitched, extent));` — Soleil, Jul 30 '21 at 23:03
`xsize` for a `cudaPitchedPtr` is not bytes. Take a look at the documentation. Or any sample code that uses it. [Here](https://stackoverflow.com/a/23313024/1695960) is an example. The constructor clearly shows the order of parameters used to construct it. The first parameter is the pointer. The second parameter is the pitch. Note how pitch is constructed from the x dimension. The third parameter is the `xsize`. Note how it is constructed from the x dimension. Anyway it's OK if you don't believe my answer is correct. Good luck! You have all the tools you need to confirm. — Robert Crovella, Jul 30 '21 at 23:55
You might also wish to `grep` the CUDA sample codes for `make_cudaPitchedPtr` for additional examples. — Robert Crovella, Jul 30 '21 at 23:58

What is the recommended way to copy cuda pitched memory to a 1D array (device to device)?

1 Answers1