Process array of arrays (with different length) to CUDA kernel

Question

I have an array of arrays (with different length) in C i would like to process within the "CUDA kernel".

const int N_ARRAYS = 1000;
int *arrayOfArrays[N_ARRAYS];
int arr1[3] = {1,2,3};
int arr2[2] = {1,4};
int arr3[4] = {1,5,3,6};
//....
int arr1000[5] = {9,9,9,10,10};

arrayOfArrays[0] = arr1;
arrayOfArrays[1] = arr2;
arrayOfArrays[2] = arr3;
//...
arrayOfArrays[1000] = arr1000;

I found this post: CUDA allocating array of arrays which gave a good idea on how it could work. But honestly i did not get it to work.

I will summarize the steps again:

You have to allocate the pointers to a host memory,
then allocate device memory for each array
and store it's pointer in the host memory.
Then allocate the memory for storing the pointers into the device
and then copy the host memory to the device memory.

Here is what i tried so far, based on the given answer. For illustration purposes i will show it with N_ARRAYS = 3, but in reality its way higher (>1000).

int main(){
    const int N_ARRAYS = 3;
    int *arrayOfArrays[N_ARRAYS];
    int arr1[1] = {1,2,3};
    int arr2[2] = {1,4};
    int arr3[3] = {1,5,3};

    arrayOfArrays[0] = arr1;
    arrayOfArrays[1] = arr2;
    arrayOfArrays[2] = arr3;

    // 1) You have to allocate the pointers to a host memory, 
    //void *h_array = malloc(sizeof(void*) * N_ARRAYS); // i use arrayOfArrays instead
    for(int i = 0; i < N_ARRAYS; i++){
        //2) then allocate device memory for each array
        cudaMalloc(&arrayOfArrays[i], i * sizeof(void*));
    }

    // 4) Allocate the memmory for storing the pointers into the device to *d_array
    void *d_array = cudaMalloc(sizeof(void*) * N_ARRAYS);

    // 5) Copy arrayOfArrays to d_array of size sizeof(void*) * N_ARRAYS from Host to device
    cudaMemcpy(d_array, arrayOfArrays, sizeof(void*) * N_ARRAYS, cudaMemcpyHostToDevice);

    // Call kernel
    multi_array_kernel<1,1>(N_ARRAYS, d_array);
    cudaThreadSynchronize();

    for(int i = 0; i < N_ARRAYS; i++){
        cudaFree(arrayOfArrays[i]); //host not device memory
        //TODO: check error
    }
    cudaFree(d_array);
    free(arrayOfArrays);
}

and the kernel:

__global__ void multi_array_kernel( int N, void** arrays ){
    int nr;
    int sum = 0;
    for(nr = 0; nr < N; nr++){
        if(arrays[nr+0] == arrays[nr-1+0]) sum +=1; // some sample calc.
    }

}

Robert Crovella · Accepted Answer · 2018-04-26T22:02:45.800

Your array sizes don't make sense:
```
int arr1[1] = {1,2,3};
         ^
         array length of 1 integer storage
```
You can't initialize an array of length 1 with 3 integers. I would think that would throw a compile error.
You're already using arrayOfArrays[] to store the pointers to the host arrays, we'll need another similar variable to store the array of corresponding device pointers. What you have now is just overwriting the previous values there when you use it in cudaMalloc
Your cudaMalloc operation in the for-loop is not set up correctly. This should be allocating space for each array of integers (e.g. arr1 etc.) So we need to write something like this:
```
cudaMalloc(&darrayOfArrays[i], arr_len[i] * sizeof(arr[0]));
```
Nowhere do you copy the contents of arr1 to the device (and similarly for arr2, arr3). So you're missing a step. We can easily get that done in your first for-loop, e.g.
```
cudaMemcpy(darrayOfArrays[i], arrayOfArrays[i], arr_len[i]*sizeof(int), cudaMemcpyHostToDevice); // copy contents of each array to device
```
This is not how cudaMalloc works:
```
  void *d_array = cudaMalloc(sizeof(void*) * N_ARRAYS);
```
You already seem to know how it works based on your other usage.
Your kernel code is going to index out-of-bounds when nr is 0:
```
if(arrays[nr+0] == arrays[nr-1+0]) sum +=1;
                          ^
                          out of bounds when nr = 0
```
Furthermore, I assume the whole point of this exercise is to be able to do doubly-subscripted array indexing (otherwise your kernel code doesn't make much sense, unless you are intending to compare pointers).

This is not kernel launch syntax:

multi_array_kernel<1,1>(N_ARRAYS, d_array);

If you want to handle arrays of different lengths, we'll need to keep track of array lengths somewhere, and use that during allocation/copying.

Here's a fixed example with the above items addressed:

$ cat t103.cu
#include <stdio.h>

__global__ void multi_array_kernel( int N, int **arrays ){
    int nr;
    for(nr = 1; nr < N; nr++){
        if(arrays[nr][0] == arrays[nr-1][0]) printf("match at index: %d  to index: %d\n", nr, nr-1); // some sample calc.
    }

}

int main(){
    const int N_ARRAYS = 3;
    int *arrayOfArrays[N_ARRAYS];
    int *darrayOfArrays[N_ARRAYS];
    int arr1[3] = {1,2,3};
    int arr2[2] = {1,4};
    int arr3[3] = {1,5,3};
    int **d_array;
    int arr_len[N_ARRAYS] = {3, 2, 3};
    arrayOfArrays[0] = arr1;
    arrayOfArrays[1] = arr2;
    arrayOfArrays[2] = arr3;

    // 1) You have to allocate the pointers to a host memory,
    //void *h_array = malloc(sizeof(void*) * N_ARRAYS); // i use arrayOfArrays instead
    for(int i = 0; i < N_ARRAYS; i++){
        //2) then allocate device memory for each array
        cudaMalloc(&(darrayOfArrays[i]), arr_len[i] * sizeof(int));
        cudaMemcpy(darrayOfArrays[i], arrayOfArrays[i], arr_len[i]*sizeof(int), cudaMemcpyHostToDevice); // copy contents of each array to device
    }

    // 4) Allocate the memmory for storing the pointers into the device to *d_array
    cudaMalloc(&d_array, sizeof(int*) * N_ARRAYS);

    // 5) Copy arrayOfArrays to d_array of size sizeof(void*) * N_ARRAYS from Host to device
    cudaMemcpy(d_array, darrayOfArrays, sizeof(int*) * N_ARRAYS, cudaMemcpyHostToDevice);

    // Call kernel
    multi_array_kernel<<<1,1>>>(N_ARRAYS, d_array);
    cudaDeviceSynchronize();

    for(int i = 0; i < N_ARRAYS; i++){
        cudaFree(darrayOfArrays[i]); //host not device memory
        //TODO: check error
    }
    cudaFree(d_array);
    // free(arrayOfArrays);
    printf("%s\n", cudaGetErrorString(cudaGetLastError()));
}
$ nvcc -o t103 t103.cu
$ ./t103
match at index: 1  to index: 0
match at index: 2  to index: 1
no error
$

Note that due to the "complexity" of the above, the usual advice is just to "flatten" your storage. If you search on the cuda tag in the search bar at the top of this page with [cuda] flatten you'll find plenty of definitions and examples.

Process array of arrays (with different length) to CUDA kernel

1 Answers1