As @jarod42 has pointed out, for an "automatic", "non-variable-length" C-style array as you have shown:
int values[2][3];
the storage format of such an array is identical to:
int values[2*3];
This means that we could treat that array as a linear singly-subscripted array (even though it is not):
for purpose of transfer from host to device:
#define W 3
#define H 2
int values[H][W];
int *d_values;
cudaMalloc(&d_values, H*W*sizeof(int));
cudaMemcpy(d_values, values, H*W*sizeof(int), cudaMemcpyHostToDevice);
and for purposes of access in device code, using "simulated" 2D access:
__global__ void kernel(int *values, int width, ...){
int col = threadIdx.x+blockDim.x*blockIdx.x;
int row = threadIdx.y+blockDim.y*blockIdx.y;
int my_value = values[row*width+col];
...
}
int main(){
...
kernel<<<...>>>(d_values, W, ...);
...
}
But based on the wording in your question:
Now I know that CUDA accepts 2D arrays in a linear form but how do I pass an already built array?
it seems you may be aware of the above approach, which I would generally refer to as "flattening" a 2D array to treat it in a linear fashion (perhaps with "simulated" 2D access).
In general, handling a 2D array of a width that is not known at compile time, while still allowing doubly-subscripted access in device code, is rather involved, and I would not recommend it, especially for CUDA beginners. But that is not actually the case you have presented:
a predefined 2D array to a kernel.
int values[2][3];
^
the "width"
I take this to mean the "width" (i.e. the range of the 2nd ,i.e. the last, subscript) of the array is known at compile time. In that case we can leverage the compiler to generate the necessary array indexing for us to make the transfer and usage process only slightly more complicated than the "flattened" case, while still allowing doubly-subscripted access in the kernel:
$ cat t1023.cu
#include <stdio.h>
#define W 3
#define H 2
#define BSIZE 8
typedef int arrtype[W];
__global__ void kernel(arrtype *values, int width, int height){
int col=threadIdx.x+blockDim.x*blockIdx.x;
int row=threadIdx.y+blockDim.y*blockIdx.y;
if ((row < height)&&(col < width)){
int my_val = values[row][col]; //doubly-subscripted access
printf("row: %d, col: %d, value: %d\n", row, col, my_val);
}
}
int main(){
int values[H][W];
for (int i = 0; i < H; i++)
for (int j = 0; j < W; j++)
values[i][j] = i+j;
arrtype *d_values;
cudaMalloc(&d_values, H*W*sizeof(int));
cudaMemcpy(d_values, values, H*W*sizeof(int), cudaMemcpyHostToDevice);
dim3 block(BSIZE,BSIZE);
dim3 grid((W+block.x-1)/block.x, (H+block.y-1)/block.y);
kernel<<<grid,block>>>(d_values, W, H);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t1023 t1023.cu
$ ./t1023
row: 0, col: 0, value: 0
row: 0, col: 1, value: 1
row: 0, col: 2, value: 2
row: 1, col: 0, value: 1
row: 1, col: 1, value: 2
row: 1, col: 2, value: 3
$
For a fully worked 3D (i.e. triply subscripted) example, see here