Allocating a graph using adjacency list in cuda?

Question

I want to allocate a graph using adjacency list i.e array of 'V' pointers each pointing to an array which will have the adjacent vertex (which would be unequal) so

unsigned **d_ptr; 
cudaMalloc(&d_ptr, sizeof(unsigned *)*Count);
  for(int i=0;i<Count;i++)
  {   
     cudaMalloc(&temp, sizeof(unsigned)*outdegree(i));  
  }

I can copy the temp pointer to d_ptr[i] but is there a better way to do this?

For complex data structures with lots of pointers between data elements (e.g. linked lists), UVA (Unified Virtual Addressing) offers code simplification at the potential expense of performance (the latter really depends on how you access the data from the kernel). Other than that, just deep-copy everything from host to device, replicating the structure of what points where. — void_ptr, Feb 23 '15 at 20:55

score 2 · Answer 1 · edited May 23 '17 at 12:28

If you want to stick to what you desire, which seems to be one CUDA memory allocation per vertex, your approach is correct, but is also inefficient and time-consuming.

It is inefficient because every CUDA allocation has an alignment requirement. This post (plus the CUDA documentation itself) tells that any CUDA malloc will consume at least 256 bytes of global memory. As a result, no matter how small the outdegree of your vertex is; saving pointers using your approach will consume 256 bytes per vertex. This will result in running out of memory very quickly as the graph size increases. For example, consider that in your graph every vertex has outdegree equal to 4. While the size required for every vertex is 4*8=32 assuming 64-bit addressing, every vertex will consume 256 bytes, 8 times more than what needed. Note that alignment requirement might be even more. Therefore, your suggested approach is poorly utilizing available global memory.

Your approach is also time-consuming. Memory allocations and deallocations, in the host or the device code, are time-consuming operations. You are allocating one memory region per vertex. You also have to copy temp to the device one time per vertex. So expect that it will take a lot more compared to when you allocate a memory region once.

If you want to fill your d_ptr with the pointers to the vertices on the device, instead of allocating one buffer per vertex, you can count the total number of outdegree for all the vertices once at the host side, and allocate one device buffer using it.

// Allocate one device buffer for all vertices.
unsigned nOutEdges = 0;
for( int i=0; i < Count; i++ )
    nOutEdges += outdegree(i); // outdegree[ i ]??
unsigned* d_out_nbrs;
cudaMalloc( (void**)&d_out_nbrs, sizeof(unsigned) * nOutEdges );

// Collect pointers to the device buffer inside a host buffer.
unsigned** host_array = (unsigned**) malloc( sizeof(unsigned*) * Count );
host_array[ 0 ] = d_out_nbrs;
for( int i=1; i < Count; i++ )
    host_array[ i ] = host_array[ i - 1 ] + outdegree[ i - 1 ];

// Allocate a device buffer and copy collected host buffer into the device buffer.
unsigned **d_ptr; 
cudaMalloc( &d_ptr, sizeof(unsigned *) * Count );
cudaMemcpy( d_ptr, host_array , sizeof(unsigned*) * Count, cudaMemcpyHostToDevice );

I was actually also using the CSR representation as your answer suggests but I wanted to use this representation for dynamic graphs and compare the performance of using CSR vs pure adjacency list for dynamic graphs. — hitish, Mar 04 '15 at 20:06

jschultz410 · Answer 2 · 2015-03-02T05:18:23.307

If the outdegree() of your vertexes is small, won't vary too much and you are OK with some wasted space (note that you are already wasting at least Count pointers' worth of space by double allocating), then you may prefer to do a single allocation like so:

// compute max outdegree; here I'm assuming you don't already know it
int max_outdegree = -1;

for (i = 0; i < Count; ++i)
  if (outdegree(i) > max_outdegree)
    max_outdegree = outdegree(i);

// alloc d_ptr in one flat array
unsigned *d_ptr;

cudaMalloc(&d_ptr, sizeof(unsigned)*Count*max_outdegree);

...

And then index into d_ptr like so:

unsigned ith_adj_vert_of_v = d_ptr[v * max_outdegree + i];  // use and check for -1 to indicate no such vertex

Also, do you have so many vertexes that you need 32 bits to count / ID them all? If you have less than 2^16, you can use uint16_t instead and likely cut your space usage roughly in half.

Allocating a graph using adjacency list in cuda?

2 Answers2