If you want to stick to what you desire, which seems to be one CUDA memory allocation per vertex, your approach is correct, but is also inefficient and time-consuming.
It is inefficient because every CUDA allocation has an alignment requirement. This post (plus the CUDA documentation itself) tells that any CUDA malloc will consume at least 256 bytes of global memory. As a result, no matter how small the outdegree
of your vertex is; saving pointers using your approach will consume 256 bytes per vertex. This will result in running out of memory very quickly as the graph size increases. For example, consider that in your graph every vertex has outdegree
equal to 4. While the size required for every vertex is 4*8=32 assuming 64-bit addressing, every vertex will consume 256 bytes, 8 times more than what needed. Note that alignment requirement might be even more. Therefore, your suggested approach is poorly utilizing available global memory.
Your approach is also time-consuming. Memory allocations and deallocations, in the host or the device code, are time-consuming operations. You are allocating one memory region per vertex. You also have to copy temp
to the device one time per vertex. So expect that it will take a lot more compared to when you allocate a memory region once.
If you want to fill your d_ptr
with the pointers to the vertices on the device, instead of allocating one buffer per vertex, you can count the total number of outdegree
for all the vertices once at the host side, and allocate one device buffer using it.
// Allocate one device buffer for all vertices.
unsigned nOutEdges = 0;
for( int i=0; i < Count; i++ )
nOutEdges += outdegree(i); // outdegree[ i ]??
unsigned* d_out_nbrs;
cudaMalloc( (void**)&d_out_nbrs, sizeof(unsigned) * nOutEdges );
// Collect pointers to the device buffer inside a host buffer.
unsigned** host_array = (unsigned**) malloc( sizeof(unsigned*) * Count );
host_array[ 0 ] = d_out_nbrs;
for( int i=1; i < Count; i++ )
host_array[ i ] = host_array[ i - 1 ] + outdegree[ i - 1 ];
// Allocate a device buffer and copy collected host buffer into the device buffer.
unsigned **d_ptr;
cudaMalloc( &d_ptr, sizeof(unsigned *) * Count );
cudaMemcpy( d_ptr, host_array , sizeof(unsigned*) * Count, cudaMemcpyHostToDevice );