Talonmies has already satisfactorily answered this question. Here, some further explanation that could be useful to the Community.
When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned.
CUDA provides the cudaMallocPitch
function to “pad” 2D matrix rows with extra bytes so to achieve the desired alignment. Please, refer to the “CUDA C Programming Guide”, Sections 3.2.2 and 5.3.2, for more information.
Assuming that we want to allocate a 2D padded array of floating point (single precision) elements, the syntax for cudaMallocPitch
is the following:
cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);
where
devPtr
is an output pointer to float (float *devPtr
).
devPitch
is a size_t
output variable denoting the length, in bytes, of the padded row.
Nrows
and Ncols
are size_t
input variables representing the matrix size.
Recalling that C/C++ and CUDA store 2D matrices by row, cudaMallocPitch
will allocate a memory space of size, in bytes, equal to Nrows * pitch
. However, only the first Ncols * sizeof(float)
bytes of each row will contain the matrix data. Accordingly, cudaMallocPitch
consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.
CUDA provides also the cudaMemcpy2D
function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch
. Under the above hypotheses (single precision 2D matrix), the syntax is the following:
cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)
where
devPtr
and hostPtr
are input pointers to float (float *devPtr
and float *hostPtr
) pointing to the (source) device and (destination) host memory spaces, respectively;
devPitch
and hostPitch
are size_t
input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
Nrows
and Ncols
are size_t
input variables representing the matrix size.
Note that cudaMemcpy2D
allows also for pitched memory allocation on the host side. If the host memory has no pitch, then hostPtr = Ncols * sizeof(float)
. Furthermore, cudaMemcpy2D
is bidirectional. For the above example, we are copying data from host to device. If we want to copy data from device to host, then the above line changes to
cudaMemcpy2D(hostPtr, hostPitch, devPtr, devPitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost)
The access to elements of a 2D matrix allocated by cudaMallocPitch
can be performed as in the following example:
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
In such an example, tidx
and tidy
are used as column and row indices, respectively (remember that, in CUDA, x
-threads span the columns and y
-threads span the rows to favor coalescence). The pointer to the first element of a row is calculated by offsetting the initial pointer devPtr
by the row length tidy * pitch
in bytes (char *
is a pointer to bytes and sizeof(char)
is 1
byte), where the length of each row is computed by using the pitch information.
Below, I'm providing a fully worked example to show these concepts.
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>
#include<conio.h>
#define BLOCKSIZE_x 16
#define BLOCKSIZE_y 16
#define Nrows 3
#define Ncols 5
/*****************/
/* CUDA MEMCHECK */
/*****************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
if (abort) { getch(); exit(code); }
}
}
/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int hostPtr, int b){ return ((hostPtr % b) != 0) ? (hostPtr / b + 1) : (hostPtr / b); }
/******************/
/* TEST KERNEL 2D */
/******************/
__global__ void test_kernel_2D(float *devPtr, size_t pitch)
{
int tidx = blockIdx.x*blockDim.x + threadIdx.x;
int tidy = blockIdx.y*blockDim.y + threadIdx.y;
if ((tidx < Ncols) && (tidy < Nrows))
{
float *row_a = (float *)((char*)devPtr + tidy * pitch);
row_a[tidx] = row_a[tidx] * tidx * tidy;
}
}
/********/
/* MAIN */
/********/
int main()
{
float hostPtr[Nrows][Ncols];
float *devPtr;
size_t pitch;
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++) {
hostPtr[i][j] = 1.f;
//printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
}
// --- 2D pitched allocation and host->device memcopy
gpuErrchk(cudaMallocPitch(&devPtr, &pitch, Ncols * sizeof(float), Nrows));
gpuErrchk(cudaMemcpy2D(devPtr, pitch, hostPtr, Ncols*sizeof(float), Ncols*sizeof(float), Nrows, cudaMemcpyHostToDevice));
dim3 gridSize(iDivUp(Ncols, BLOCKSIZE_x), iDivUp(Nrows, BLOCKSIZE_y));
dim3 blockSize(BLOCKSIZE_y, BLOCKSIZE_x);
test_kernel_2D << <gridSize, blockSize >> >(devPtr, pitch);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy2D(hostPtr, Ncols * sizeof(float), devPtr, pitch, Ncols * sizeof(float), Nrows, cudaMemcpyDeviceToHost));
for (int i = 0; i < Nrows; i++)
for (int j = 0; j < Ncols; j++)
printf("row %i column %i value %f \n", i, j, hostPtr[i][j]);
return 0;
}