Cuda complex object initialization within device: problem with cudaDeviceSetLimit

Question

I am trying to initialize complex objects within my device, within threads and within blocks. It seems to me I have a problem with the cudaDeviceSetLimit. Given my understanding of the problem, I am not setting correctly the heap memory amount per thread. This part of the documentation refers to my problem. But they do not initialize an object. I have also read this post but wasn't able to get my code working.

Edit

Contrary to the first answer: doing this inside the kernel is a must in my problem configuration, because I want to take advantage of initializing the objects across blocks in parallel.

I have made the following toy example which works for a low number of blocks (65) but not for 65535 blocks (the maximum amount of blocks I could use on my device):

class NNode{

    public:

        int node_id;
};

class cuNetwork{

    public:

        int num_allnodes;
        NNode** all_nodes; 

};

__global__ void mallocTest(int num_allnodes, cuNetwork** arr_gpu_net){

    int bId = blockIdx.x; 
    cuNetwork* gpu_net  = new cuNetwork(); 
    gpu_net->all_nodes = new NNode*[num_allnodes];

    for(int i=0; i<num_allnodes; i++){

            gpu_net->all_nodes[i] = new NNode();
    }

    arr_gpu_net[bId] = gpu_net;

}

int main(int argc, const char **argv){

    int numBlocks = 65; 
    int num_allnodes = 200; 

    cuNetwork** arr_gpu_net = new cuNetwork*[numBlocks];
    cudaMalloc((void **)&arr_gpu_net, sizeof(cuNetwork*) * numBlocks);

    size_t size; 
    //for each block
    size = sizeof(cuNetwork);//new cuNetwork()
    size += sizeof(NNode*) * num_allnodes;//new NNode*[num_allnodes] 
    size += sizeof(NNode) * num_allnodes; //for()... new NNode()
    
    //size = sizeof(cuNetwork) + (sizeof(int) * 2 + sizeof(NNode)) * num_allnodes;
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, numBlocks * size);
    mallocTest<<<numBlocks, 1>>>(num_allnodes, arr_gpu_net);

    cudaDeviceSynchronize();

    return 0;

}

As soon as I start adding additional properties to the objects, or if I increase numBlocks to 65535, I get the error:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555555efff90

Thread 1 "no_fun" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (7750,0,0), thread (0,0,0), device 0, sm 1, warp 3, lane 0]
0x0000555555f000b0 in mallocTest(int, cuNetwork**)<<<(65535,1,1),(1,1,1)>>> ()

My question is: in this example, how should I initialize properly cudaDeviceSetLimit in order to have the correct amount of memory needed for the initialization per thread of cuNetwork ?

You are making all sorts of assumptions about the behaviour of memory allocation on the device that are likely not true, especially alignment and granularity. The idea that calling new n times to create n objects will require exactly n times the size of the object in bytes is almost never correct — talonmies, Nov 20 '21 at 04:12

Abator Abetor · Accepted Answer · 2021-11-18T20:55:27.693

To answer your question:

Due to memory padding and allocation granularity, each block probably requires more memory than the calculated size. You should always check the return value of new. If it is nullptr, the allocation failed.

However, if the total number of nodes for all networks is known up front, it would be more efficient to just cudaMalloc the memory for all nodes, (and all networks). Then, in the kernel just update the pointers accordingly.

Something like this:

struct cuNetwork2{
    int num_allnodes;
    NNode* all_nodes;
}

__global__ void kernel(cuNetwork2* d_networks, Node* d_nodes, int numNodesPerNetwork){
   int index = ...
   d_networks[index].num_allnodes = numNodesPerNetwork;
   d_networks[index].all_nodes = d_nodes + index * numNodesperNetwork;
}

...

int numBlocks = 65; 
int num_allnodes = 200;

cuNetwork2* d_networks;
NNode* d_nodes;
cudaMalloc(&d_networks, sizeof(cuNetwork2) * numBlocks);
cudaMalloc(&d_nodes, sizeof(NNode) * numBlocks * num_allnodes);

kernel<<<>>>(d_networks, d_nodes, num_allnodes);

In this case, you don't need cudaDeviceSetLimit or in-kernel dynamic allocation.

Here is a simple performance comparison between both approaches.

#include <iostream>
#include <chrono>
#include <cassert>

class NNode{

public:

    int node_id;
};

class cuNetwork{

public:

    int num_allnodes;
    NNode** all_nodes; 

};

__global__ void mallocTest(int num_allnodes, cuNetwork** arr_gpu_net){

    int bId = blockIdx.x; 
    cuNetwork* gpu_net  = new cuNetwork(); 
    gpu_net->all_nodes = new NNode*[num_allnodes];

    for(int i=0; i<num_allnodes; i++){

            gpu_net->all_nodes[i] = new NNode();
    }

    arr_gpu_net[bId] = gpu_net;

}


__global__ void mallocTestMultiThreadBlock(int num_allnodes, cuNetwork** arr_gpu_net){

    int bId = blockIdx.x; 
    if(threadIdx.x == 0){
        cuNetwork* gpu_net  = new cuNetwork(); 
        assert(gpu_net != nullptr);

        gpu_net->all_nodes = new NNode*[num_allnodes];
        assert(gpu_net->all_nodes != nullptr);

        arr_gpu_net[bId] = gpu_net;
    }
    __syncthreads();

    for(int i=threadIdx.x; i<num_allnodes; i += blockDim.x){
        arr_gpu_net[bId]->all_nodes[i] = new NNode();
    }


}


struct cuNetwork2{
    int num_allnodes;
    NNode* all_nodes;
};

__global__ 
void pointerassignmentkernel(cuNetwork2* d_networks, NNode* d_nodes, int numNodesPerNetwork){
    int bId = blockIdx.x; 
    d_networks[bId].num_allnodes = numNodesPerNetwork;
    d_networks[bId].all_nodes = d_nodes + bId * numNodesPerNetwork;
}

__global__ 
void nodeinitkernel(NNode* d_nodes, int totalNumNodes){
    const int id = threadIdx.x + blockIdx.x * blockDim.x;
    if(id < totalNumNodes){
        new (&d_nodes[id]) NNode();
    }
}

int main(int argc, const char **argv){

    int numBlocks = 64; 
    int num_allnodes = 200; 

    cuNetwork** arr_gpu_net = new cuNetwork*[numBlocks];
    cudaMalloc((void **)&arr_gpu_net, sizeof(cuNetwork*) * numBlocks);

    size_t size; 
    //for each block
    size = sizeof(cuNetwork);//new cuNetwork()
    size += sizeof(NNode*) * num_allnodes;//new NNode*[num_allnodes] 
    size += sizeof(NNode) * num_allnodes; //for()... new NNode()

    //size = sizeof(cuNetwork) + (sizeof(int) * 2 + sizeof(NNode)) * num_allnodes;
    
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1ull * 1024ull * 1024ull * 1024ull); // I set this to 1GB which did not cause errors for the given problem size

    std::chrono::time_point<std::chrono::system_clock> timeA = std::chrono::system_clock::now();
    mallocTest<<<numBlocks, 1>>>(num_allnodes, arr_gpu_net);
    //mallocTestMultiThreadBlock<<<numBlocks, num_allnodes>>>(num_allnodes, arr_gpu_net);

    cudaError_t status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
    std::chrono::time_point<std::chrono::system_clock> timeB = std::chrono::system_clock::now();

    std::cerr << "mallocTest kernel: " << std::chrono::duration<double>(timeB - timeA).count() << "s\n";


    


    std::chrono::time_point<std::chrono::system_clock> timeC = std::chrono::system_clock::now();
    cuNetwork2* d_networks;
    NNode* d_nodes;
    cudaMalloc(&d_networks, sizeof(cuNetwork2) * numBlocks);
    cudaMalloc(&d_nodes, sizeof(NNode) * numBlocks * num_allnodes);
    std::chrono::time_point<std::chrono::system_clock> timeD = std::chrono::system_clock::now();
    
    std::cerr << "host cudaMalloc: " << std::chrono::duration<double>(timeD - timeC).count() << "s\n";
    
    pointerassignmentkernel<<<numBlocks, 1>>>(d_networks, d_nodes, num_allnodes);
    
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
    std::chrono::time_point<std::chrono::system_clock> timeE = std::chrono::system_clock::now();
    
    std::cerr << "pointerassignmentkernel: " << std::chrono::duration<double>(timeE - timeD).count() << "s\n";   

    nodeinitkernel<<<(numBlocks * num_allnodes + 128 - 1) / 128, 128>>>(d_nodes, numBlocks * num_allnodes);
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
    std::chrono::time_point<std::chrono::system_clock> timeF = std::chrono::system_clock::now();
    
    std::cerr << "nodeinitkernel: " << std::chrono::duration<double>(timeF - timeE).count() << "s\n"; 
    
    cudaDeviceReset();

    return 0;

}

Compiled with: nvcc -arch=sm_61 -O3 kernelallocation.cu -o kernelallocation
mallocTest kernel: 0.0183772s
host cudaMalloc: 5.02e-06s
pointerassignmentkernel: 1.2739e-05s
nodeinitkernel: 1.213e-05s

Perhaps I wasn't clear enough but this is not the answer I am looking for. I am aware that I could this in a simpler way, but I want and I must initialize the networks in parallel on the device, I don't want to do the initialization serially outside the kernel, so I can't accept this as an answer. I will edit my question — Joachim, Nov 18 '21 at 18:20
Additionally, in your answer, the pointer array to the nodes is all the same accross blocks, which is not what I want, because each thread must have its own array of pointer nodes. — Joachim, Nov 18 '21 at 18:39
The kernel argument is the same, yes, but `d_networks[bId].all_nodes` will contain a different adress per thread (per thread block with 1 thread each) — Abator Abetor, Nov 18 '21 at 19:53
In-kernel allocation is horribly slow. I added a benchmark showing that the "serial" approach on the host can be orders of magnitude faster. — Abator Abetor, Nov 18 '21 at 19:54
The reason the host allocation in your example is much faster is that you do allocate the memory for the nodes, but you do not initialize the node objects. So the object initialization is missing compared to the first kernel call. Correct me if you see this in another way. Nevertheless this ' 1ull * 1024ull * 1024ull * 1024ull' is a really good point I wasn't aware about — Joachim, Nov 18 '21 at 20:34
Added a kernel which calls the constructor of each node. Does not change the overall picture. I do not understand your argument about pointers pointing to the same node. There are 12800 unique nodes. Nodes 0 - 199 belong to network 0, nodes 200 - 399 belong to network 1 and so on. My code does not use 2d arrays per network as in your code, but a 1d array. — Abator Abetor, Nov 18 '21 at 20:59
Ok, forget my last argument. I misunderstood the "d_nodes + bId * numNodesPerNetwork". Now I see... This makes sense. Let me study it in a deeper way but I think I have some good inputs here. Thank you very much — Joachim, Nov 18 '21 at 21:30
Ok so I tried your trick, which indeed works well with respect to initialising all nodes and putting them at the right place for objects. The problem is, I cannot iterate over arrays within objects starting from a 0 position. For example, I cannot iterate through 'all_nodes' of a cuNetwork object starting from 0. It is a problem because it's something I'm doing a lot later in the code. Do you eventually have a trick for this one too ? — Joachim, Nov 27 '21 at 11:44
I suggest asking a new question, showing the code which you are having trouble with. — Abator Abetor, Nov 27 '21 at 11:51
Ok, indeed, I think it will be much more clear. I will first do a bit of research myself — Joachim, Nov 27 '21 at 11:53

Cuda complex object initialization within device: problem with cudaDeviceSetLimit

Edit

1 Answers1