Indexing Dynamic Arrays in CUDA

Question

I am learning CUDA. I am having difficulty indexing dynamically sized arrays that I have transferred to the device.

As a toy problem I am trying to sum two matrices on interior points only. I have found that if I use static arrays the result is as expected but in practice my data will be too big to fit on the stack so it must be dynamically sized arrays.

I think the reason static arrays work as expected is because the data is tight or packed, no gaps between adjacent elements in memory. But the dynamic array is disjointed and points to various places in memory for each 'row' of the matrix. This is reflected in the result below where the elements are not where I'd expect them to be.

I have found several examples online that show how to index 2D arrays on the device but they all use static arrays as examples.

I'd appreciate any advice on how to approach this.

My idea right now is to flatten the dynamically sized matrices on the host first (so convert them to double * of size Nx * Ny * sizeof(double)) and then copy that to the device? But if I do that I'm not sure I can use the same indexing scheme I have shown below (and which I'd prefer to keep if possible)?

Dynamic Array Result

2.000000 2.000000 2.000000 0.000000 0.000000 2.000000 
2.000000 0.000000 0.000000 2.000000 2.000000 2.000000 
0.000000 2.000000 2.000000 2.000000 2.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

Static Array Result

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 2.000000 2.000000 2.000000 2.000000 0.000000 
0.000000 2.000000 2.000000 2.000000 2.000000 0.000000 
0.000000 2.000000 2.000000 2.000000 2.000000 0.000000 
0.000000 2.000000 2.000000 2.000000 2.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

MWE

#include <stdio.h>
#include <stdlib.h>

#define THREADS_PER_BLOCK 2

__global__ void matrix_add(double *C, double *A, double *B, int Nx, int Ny);
double **create_matrix(int row, int col);

int main(void)
{
    int Nx = 6;
    int Ny = 6;
    // Create dynamically sized arrays on host
    double **A = create_matrix(Nx, Ny);
    double **B = create_matrix(Nx, Ny);
    double **C = create_matrix(Nx, Ny);
    // Fill arrays
    for (int i = 0; i < Nx; i++)
    {
        for (int j = 0; j < Ny; j++)
        {
            A[i][j] = 1.0;
            B[i][j] = 1.0;
            C[i][j] = 0.0;
        }
    }
    // Next create arrays for device
    double *d_A, *d_B, *d_C;
    // Allocate memory on GPU
    int size = Nx * Ny * sizeof(double);
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);
    // Copy host arrays over to device arrays.
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, C, size, cudaMemcpyHostToDevice);
    dim3 dimBlock(THREADS_PER_BLOCK, THREADS_PER_BLOCK);
    int blockdimX = (int)ceil(Nx / dimBlock.x);
    int blockdimY = (int)ceil(Ny / dimBlock.y);
    dim3 dimGrid(blockdimX, blockdimY);
    matrix_add<<<dimGrid, dimBlock>>>(d_C, d_A, d_B, Nx, Ny);
    // Copy back to CPU
    cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
    // inspect result
    for (int i = 0; i < Nx; i++)
    {
        for (int j = 0; j < Ny; j++)
        {
            printf("%f ", C[i][j]);
        }
        printf("\n");
    }

    return EXIT_SUCCESS;
}

__global__ void matrix_add(double *C, double *A, double *B, int Nx, int Ny)
{
    int col = threadIdx.x + blockIdx.x * blockDim.x;
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int index = col + row * Nx;
    if (row != 0 && col != 0 && row < (Ny - 1) && col < (Nx - 1))
    {
        C[index] = A[index] + B[index];
    }
}

double **create_matrix(int row, int col)
{
    double **mat;
    mat = (double **)calloc(row, sizeof(double *));
    for (int i = 0; i < row; i++)
    {
        mat[i] = (double *)calloc(col, sizeof(double));
    }
    return mat;
}

cudaMemcpy can’t flatten your arrays of pointers and copy their contents. You must use a more complicated copy process using one copy for each row/column to the correct device destination (and vice versa) — talonmies, Oct 25 '22 at 04:05
The way you create matrices on the host is not only bad for transfering them to the device, it is also very inefficient in CPU code due to cache locality etc. You always want to work on one big block of memory instead of allocating many small ones. If you really need to be able to interface via `mat[i][j]` instead of `mat[i* ncol + j]`, you could still create an additional array of pointers into that big blob of memory, just keep in mind that it adds a layer of indirection (dereferencing a pointer) which can still harm performance. — paleonix, Oct 25 '22 at 12:30
If you are adamant about using "unflattened" arrays, you might want to look into unified memory (i.e. use `cudaMallocManaged` instead of `calloc`). It isn't ideal for performance but allows fast prototyping without worrying about copying between host and device. Later one can arrive at better performance by adding prefetch statements. Another interesting point might be that the CUDA runtime API has functions for allocating memory for flattened multidimensional arrays in a way that improves performance for more complicated operations than matrix addition (e.g. matrix product) by adding padding. — paleonix, Oct 25 '22 at 12:37
@paleonix Oh, sorry I did not realize that it was so bad. I will take a look at re-writing the code so that it is allocated as one long array on the CPU. I'm not "adamant" about it, I will certainly try to incorporate your suggestion if it is the better way to do things. — Nukesub, Oct 25 '22 at 13:12
Nothing to be sorry about, it's a typical thing to do when starting out with C, because one one usually starts with statically sized arrays where it is less of a problem. If you would be using C++ you could add nicer looking accessor functionality. C++23 will hopefully bring `std::mdspan` for this kind of thing. In C it might still be a good idea to have inlined functions (or makros) like `void write_to(double *mat, int i, int j, double val)` and the corresponding `read_to`, so you don't have to change the same thing everywhere in your code when you decide to change an access pattern. — paleonix, Oct 25 '22 at 15:28

Indexing Dynamic Arrays in CUDA

0 Answers0