I am trying to initialize complex objects within my device, within threads and within blocks. It seems to me I have a problem with the cudaDeviceSetLimit
. Given my understanding of the problem, I am not setting correctly the heap memory amount per thread. This part of the documentation refers to my problem. But they do not initialize an object. I have also read this post but wasn't able to get my code working.
Edit
Contrary to the first answer: doing this inside the kernel is a must in my problem configuration, because I want to take advantage of initializing the objects across blocks in parallel.
I have made the following toy example which works for a low number of blocks (65) but not for 65535 blocks (the maximum amount of blocks I could use on my device):
class NNode{
public:
int node_id;
};
class cuNetwork{
public:
int num_allnodes;
NNode** all_nodes;
};
__global__ void mallocTest(int num_allnodes, cuNetwork** arr_gpu_net){
int bId = blockIdx.x;
cuNetwork* gpu_net = new cuNetwork();
gpu_net->all_nodes = new NNode*[num_allnodes];
for(int i=0; i<num_allnodes; i++){
gpu_net->all_nodes[i] = new NNode();
}
arr_gpu_net[bId] = gpu_net;
}
int main(int argc, const char **argv){
int numBlocks = 65;
int num_allnodes = 200;
cuNetwork** arr_gpu_net = new cuNetwork*[numBlocks];
cudaMalloc((void **)&arr_gpu_net, sizeof(cuNetwork*) * numBlocks);
size_t size;
//for each block
size = sizeof(cuNetwork);//new cuNetwork()
size += sizeof(NNode*) * num_allnodes;//new NNode*[num_allnodes]
size += sizeof(NNode) * num_allnodes; //for()... new NNode()
//size = sizeof(cuNetwork) + (sizeof(int) * 2 + sizeof(NNode)) * num_allnodes;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, numBlocks * size);
mallocTest<<<numBlocks, 1>>>(num_allnodes, arr_gpu_net);
cudaDeviceSynchronize();
return 0;
}
As soon as I start adding additional properties to the objects, or if I increase numBlocks
to 65535, I get the error:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555555efff90
Thread 1 "no_fun" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (7750,0,0), thread (0,0,0), device 0, sm 1, warp 3, lane 0]
0x0000555555f000b0 in mallocTest(int, cuNetwork**)<<<(65535,1,1),(1,1,1)>>> ()
My question is: in this example, how should I initialize properly cudaDeviceSetLimit
in order to have the correct amount of memory needed for the initialization per thread of cuNetwork
?