BFS implementation issues on CUDA

Question

I've the following algorithm that runs parallel BFS using CUDA 7.5. The function takes in an array of edges and an array of vertices. An edge is defined as

typedef struct Edge
{
    int first;
    int second;

}Edge;

The vertices array is initialized to -1 for all vertices except the starting vertex which is 0. Thus we have something like

 1 2 3 4 5 6.....1024
-1-1-1-1 0-1.....-1

In this case the starting vertex is 4 Assuming a sparse graph the edge list would have data in the following manner(the data would be in the edge stuff, but I've provided the integer representation)

1 2  3  4 5 .......2048
3 17 12 1 3........2010
9 34 20 9 17.......196

The BFS is supposed to run parallely with 2048 threads with any thread having 0 as the first/second index writing to the relevant index in the vertices array 1 and modify a bool to 1. This is the BFS code.

__global__ void bfs(Edge* edges, int* vertices, int current_depth, int* modified){

    int e = blockDim.x * blockIdx.x + threadIdx.x;
    int vfirst = edges[e].first;
    if (vfirst > 1023) {printf("oops %d:%d\n", e, vfirst); return;}
    int dfirst = vertices[vfirst];
    int vsecond = edges[e].second;
    if (vsecond > 1023) {printf("oops %d:%d\n", e, vsecond); return;}
    int dsecond = vertices[vsecond];

    if((dfirst == current_depth) && (dsecond == -1)){
        vertices[vsecond] = current_depth;
        printf("e:%d  depth:%d\n", e, current_depth);
        __syncthreads();
        *modified = 1;
        printf("%d\n", *modified);
    }else if((dsecond == current_depth) && (dfirst == -1)){
        vertices[vfirst] = current_depth;
        printf("e:%d depth:%d\n", e, current_depth);
        __syncthreads();
        *modified = 1;
        printf("%d\n", *modified);
    }
}

This BFS kernel is being called repeatedly by the main code incrementing the value of current depth everytime it is called. This is the relevant section of the calling code.

begin = clock();

    do{

        h_modified = 0;
        //printf("Entered while loop\n");
        err = cudaMemcpy(d_modified, &h_modified, sizeof(int), cudaMemcpyHostToDevice);
        if (err != cudaSuccess)
        {
            fprintf(stderr, "Failed to copy h_done to device(error code %s)!\n", cudaGetErrorString(err));
            exit(EXIT_FAILURE);
        }

        printf("CUDA kernel launching with %d blocks of %d threads\n", edgeBlocks, threadsPerBlock);

        bfs<<<edgeBlocks, threadsPerBlock>>>(d_edges, d_vertices, current_depth, d_modified);
        cudaThreadSynchronize();

        err = cudaGetLastError();
        if (err != cudaSuccess)
        {
            fprintf(stderr, "Failed to launch bfs kernel (error code %s)!\n", cudaGetErrorString(err));
            exit(EXIT_FAILURE);
        }
        //printf("Second kernel launch finished\n");

        err = cudaMemcpy(&h_modified, d_modified, sizeof(int), cudaMemcpyDeviceToHost);
        printf("%d\n", h_modified);
        if (err != cudaSuccess)
        {
            fprintf(stderr, "Failed to copy d_done to host(error code %s)!\n", cudaGetErrorString(err));
            exit(EXIT_FAILURE);
        }

        printf("BFS run for level %d\n", current_depth);
        current_depth++;


    }while(h_modified != 0);

    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Time taken: %f\n", time_spent);

The memcpy and mallocs have all been correctly and checked up. My problem is that in the first run of BFS the modified variable never gets modified to 1 and thus the function never gets called a second time. I've thoroughly checked my logic but just can't seem to put my finger on the problem. This is my first complete project in CUDA thus any help would be appreciated. For those who want a complete verifiable example please use the link,

https://github.com/soumasish/ParallelBreadthFirstSearch

run your code with `cuda-memcheck`. You will see that `cuda-memcheck` is identifying invalid `__global__` reads emanating from your `bfs` kernel. Then recompile your code with the `-lineinfo` switch added, and follow the directions [here](http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218) to identify the specific line of source code in your `bfs` kernel that is generating the invalid global read. Continue to fix your program until `cuda-memcheck` reports no errors. — Robert Crovella, Dec 16 '15 at 18:31
Add this line of code: `if (vfirst > 1023) {printf("oops!\n"); return;}` immediately after this line of code in your `bfs` kernel: `int vfirst = edges[e].first;` If we ever see "oops" printout, then it means you have an invalid indexing going on, because the size of the `vertices` array is only 1024, right? You can put a similar debug index check after this line of code: `int vsecond = edges[e].second;` to test if `vsecond` will be out-of-range (it is, in some cases.) — Robert Crovella, Dec 16 '15 at 18:42
You mean I add this code to the BFS function. But will it print, since my understanding is that the CUDA kernel has no I/O capability, so will it run printf. But I'll try right away. — Moffet, Dec 16 '15 at 18:50
Yes, CUDA provides a special capability to handle `printf` (only) from the kernel. This is covered in the [documentation](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output). A principal use of this capability is for quick debug like this. Once you have this identified, you will need to begin to address the question of why `vfirst` and `vsecond` are ending up out-of-range (i.e. higher than 1023) for some threads. — Robert Crovella, Dec 16 '15 at 18:52
Well this actually worked. But now I'm even more confused, my data generation code will not write an index greater than 1024, so is the line `int e = blockDim.x * blockIdx.x + threadIdx.x;` doing something wrong. This is the data generation code. [link] [https://github.com/soumasish/GenerateDataSet] — Moffet, Dec 16 '15 at 18:54
1024 is an illegal index for an array of size 1024. That is in fact what is happening here. Replace the index check code with something like this: `if (vfirst > 1023) {printf("oops %d:%d\n", e, vfirst); return;}` Do something similar for `vsecond`. You will see that thread 517 has an (illegal) index of 1024 on `vfirst`, thread 1904 has an index of 1024 on `vsecond` and thread 2009 has an index of 1024 on `vfirst`. You may wish to review how arrays in C work. — Robert Crovella, Dec 16 '15 at 18:59
I've fixed the data errors, there are no more data errors now, its still not modifying the modified variable. Also I'm using a specific number 85 which is in the data set and trying to log that number for the relevant thread, but that's not happening either, which is I think the reason why the BFS function is never getting called again. I've modified the BFS code here with the debugging log and also on github. Really appreciate the help. — Moffet, Dec 16 '15 at 19:14
The only conclusion at this point is that neither of your `if` conditions are being satisfied, since if either one of those were satisfied, then `modified` should be set to 1. That certainly seems to me like another problem in your data setup. Note that in your algorithm description you say: "The vertices array is initialized to -1 for all vertices except the starting vertex which is 0. " but in your pictorial example you are indicating **just the opposite**: `0 0 0-1 0 0.....0` Which is it supposed to be? All -1 with a single zero for the starting vertex, or all 0 with a single -1? — Robert Crovella, Dec 16 '15 at 19:27
Ah I'll amend that now in the question, but it is actually initialized to -1 for all and zero for the starting vertex. I've even copied and printed out my edges array on the device to check whether the data set is correct and it seems correct to this point. — Moffet, Dec 16 '15 at 19:38
This line of code in main.cu is broken, the ampersand does not belong there: `err = cudaMemcpy(&h_vertices, d_vertices, VERTEX_BYTES, cudaMemcpyDeviceToHost);` Change that to this: `err = cudaMemcpy(h_vertices, d_vertices, VERTEX_BYTES, cudaMemcpyDeviceToHost);` Also, your initial `current_depth` is set to 1, but that cannot possibly work for your initial `vertices` array which consists of either 0 or -1, correct ? — Robert Crovella, Dec 16 '15 at 19:51
Really? This code *again*? How many accounts do you have anyway? — talonmies, Dec 16 '15 at 19:55
When I make the above 2 changes to your code (remove `h_vertices` ampersand and initialize `current_depth` to 0 instead of 1) then your code runs for two iterations instead of 1. Hopefully by now you have a sense of how to debug this. Setting `current_depth` initially to 1 prevented either `if` condition in the first pass of your `bfs` kernel from being satisfied, for your initial data set. Your initial data set has a zero vertex at position 85 in the `vertices` array, and therefore the thread that has 85 for `vsecond` (thread 5) initially must trigger to get the ball rolling. — Robert Crovella, Dec 16 '15 at 20:05

Robert Crovella · Answer 1 · 2015-12-17T02:38:06.437

My problem is that in the first run of BFS the modified variable never gets modified to 1 and thus the function never gets called a second time.

The first problem that was encountered was out-of-bounds accesses as indicated by cuda-memcheck. This was traced to the fact that the edges array was incorrectly including indices of 1024, which is not valid since the size of the vertices array was 1024. The fix was to modify the edges array so that no values greater than 1023 were present.
The other main problem discovered was that the current_depth variable was being set to 1 initially (in main.cu). This prevented either if condition in the bfs kernel from being satisfied. The if conditions:
```
    if((dfirst == current_depth) &&...
```
depended on a value (dfirst) being retrieved from the vertices array matching the current_depth value. But since the initial population of the vertices array was all 0 or -1 (as stated in the problem description), there was no possibility to satisfy either if condition in the first kernel launch. As a result, the modified variable never gets changed to 1, and so no additional kernel launches occur. The fix is to set current_depth initially to zero in main.cu.
Additionally, the following line of code was observed in main.cu:
```
err = cudaMemcpy(&h_vertices, d_vertices, VERTEX_BYTES, cudaMemcpyDeviceToHost);
```
Since h_vertices is already a pointer, it is not correct to take the address of it here, so the ampersand is out of place here. Using that code as-is would be a recipe for stack corruption in your program. The fix is to remove the ampersand.

BFS implementation issues on CUDA

1 Answers1