1

I have a very big array in Device memory, and I need to partition it into some smaller parts. Now, I wondered if I could use an array of arrays to access them by indices.

I tried to write the following code, however, it returns rubbish which is I think because of its undefined behavior. It has no error and I don't know if it is possible.

#include <stdio.h>
#include <assert.h>
#include <iostream>

inline
cudaError_t checkCuda(cudaError_t result) {
#if defined(DEBUG) || defined(_DEBUG)
  if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess);
  }
#endif
  return result;
}

__global__ void cudaVectorFill(int **array, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        array[0][i] = 1;
    else if (i < 2 * N)
        array[1][i - N] = 2;
    else if (i < 3 * N)
        array[2][i - 2 * N] = 3;
}

int main() {

    int N = 100000000;

    int **array = new int*[3];
 
    checkCuda( cudaMalloc(&array[0], N * sizeof(int)) );
    checkCuda( cudaMalloc(&array[1], N * sizeof(int)) );
    checkCuda( cudaMalloc(&array[2], N * sizeof(int)) );
 
    cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array, N);

    checkCuda( cudaPeekAtLastError() );
 
    auto *host_array0 = new int[1];
    auto *host_array1 = new int[1];
    auto *host_array2 = new int[1];
 
    checkCuda( cudaMemcpy(host_array0, array[0], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
    checkCuda( cudaMemcpy(host_array1, array[1], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
    checkCuda( cudaMemcpy(host_array2, array[2], 1 * sizeof(int), cudaMemcpyDeviceToHost) );
 
    std::cout << *host_array0 << std::endl << *host_array1 << std::endl << *host_array2 << std::endl;

    return 0;
}

Output is something like:

707093096
707093104
707093112

Correct Output should be:

1
2
3
halfer
  • 19,824
  • 17
  • 99
  • 186
  • 3
    The memory pointed to by `array` Is host memory. You can’t pass plain host pointers to CUDA kernels like that. You need to allocate memory for it on the GPU – talonmies Apr 17 '22 at 01:02
  • @talonmies Thanks. Can we do something similar to partition a big array into some smaller ones with indices? – Mojtaba Valizadeh Apr 17 '22 at 01:03
  • 2
    You can do it with pointer arithmetic or pointers. Just get the memory spaces correct – talonmies Apr 17 '22 at 01:04
  • @talonmies Could you please make a simple example for it? I really don't know how to use device pointers to allocate memory and use it by indices. I really appreciate your time. – Mojtaba Valizadeh Apr 17 '22 at 01:16
  • 1
    You could use arrays of a single dimension and calculate multidimensional indices manually. E.g. `[row*100+column]` – Sebastian Apr 17 '22 at 02:22
  • @Sebastian Thanks. I wish I could. However, as I mentioned in the question, my array is super big, and I cannot allocate very big and contiguous blocks of memory for that. So I need to partition it into smaller arrays. However, I wondered if I could use array of arrays in device memory. – Mojtaba Valizadeh Apr 17 '22 at 02:36
  • 2
    You mentioned both facts, but gave no actual reasoning in the question. It is unusual that you cannot allocate large blocks of device memory. What is your GPU memory size and what is the largest block you can successfully allocate (without any allocations before - as far as possible, if you use the graphics card also for displaying on the screen, there will be some memory already allocated and used). Can you also use `cudaGetMemInfo` and post the two results, please? – Sebastian Apr 17 '22 at 06:08
  • @Sebastian Many thanks for your comment. Actually, I am using Google Colab Pro which has about 25GB of RAM. I need two arrays with a billion uint64_t variables and two similar arrays with uint32_t variables. Although 25GB is enough for it, apparently, it cannot allocate these super big contiguous blocks of memory and it returns out of memory. That is why I investigate it and due to some advice, I decided to partition it into some smaller arrays. – Mojtaba Valizadeh Apr 17 '22 at 22:20
  • The Host RAM may be barely enough (with 25GB, but you could transfer and process the memory block-wise), but not the GPU memory. You likely get the K80, P100 or T4. And with them you either have 12 GB or 16 GB GPU RAM or a card, which acts like two separate graphics cards with 12 GB each. So even partitioning seemingly will not be enough. – Sebastian Apr 18 '22 at 04:38
  • 2
    They also have A100 with 40GB, but you probably need to order a pro+ account and be lucky to get an A100 assigned. – Sebastian Apr 18 '22 at 04:46
  • @Sebastian I really appreciate your time for your useful comments. – Mojtaba Valizadeh Apr 18 '22 at 23:35

2 Answers2

3

As noted in comments, if you are passing pointers to a GPU kernel, they have to be accessible to the GPU. That means you either explicitly allocate a copy of the host array of device pointers and populate it on the device, or rely on managed or otherwise GPU accessible host memory.

One approach that will probably work in this case is:

int N = 100000000;

int **array = new int*[3];
 
checkCuda( cudaMalloc(&array[0], N * sizeof(int)) );
checkCuda( cudaMalloc(&array[1], N * sizeof(int)) );
checkCuda( cudaMalloc(&array[2], N * sizeof(int)) );

int **array_d;
checkCuda( cudaMalloc(&array_d, 3 * sizeof(int*)) );
checkCuda( cudaMemcpy(array_d, array, 3 * sizeof(int*), cudaMemcpyHostToDevice) );
 
cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array_d, N);

[Standard disclaimer, code written in browser, no guarantees implied or given, use at own risk]

i.e. after building array in host memory, make a copy in GPU memory and pass that GPU memory copy to your kernel. There might be other problems in your code, I haven't analyzed further than the first six lines.

talonmies
  • 70,661
  • 34
  • 192
  • 269
1

FYI, I just found another approach for 2D allocation in device memory. See method 3 in this example for more information. So we can use something like:

int N = 100000000;

int **array;
cudaMallocManaged(&array, 3 * sizeof(int *));
cudaMallocManaged(&(array[0]), N * sizeof(int));
cudaMallocManaged(&(array[1]), N * sizeof(int));
cudaMallocManaged(&(array[2]), N * sizeof(int));

cudaVectorFill<<<(3 * N + 1023) / 1024, 1024>>>(array, N);

It also worked fine.