The Floyd-Warshall algorithm in CUDA

Question

This is the sequential piece of code I am trying to parallelize in CUDA

/*
    Sequential (Single Thread) APSP on CPU.
*/
void floyd_sequential(int *mat, const size_t N)
{
    for(int k = 0; k < N; k ++)
        for(int i = 0; i < N; i ++)
            for(int j = 0; j < N; j ++)
            {
                int i0 = i*N + j;
                int i1 = i*N + k;
                int i2 = k*N + j;
                if(mat[i1] != -1 && mat[i2] != -1)
                    mat[i0] = (mat[i0] != -1 && mat[i0] < mat[i1] + mat[i2]) ?
                      mat[i0] : (mat[i1] + mat[i2]);
            }
}

This is my CUDA implementation

// ParallelComputing.cpp : Defines the entry point for the console application.
//

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>


#define DIMENSION 10;
__global__ void gpu_Floyd(int *result, int N)
{
    int j,k;
    int Row = blockIdx.y * blockDim.y + threadIdx.y;

    for(k = 0; k < N; k++)
    {
        for(j = 0; j < N; j++)
        {
            int i0 = Row * N + j;  
            int i1 = Row * N + k;
            int i2 = k * N + j;
            if(result[i0] != -1 && result[i2] != -1)
                    result[i0] = (result[i0] != -1 && result[i0] < result[i1] + result[i2]) ?
                      result[i0] : (result[i1] + result[i2]);
            __syncthreads();
        }
    }
}

   void GenMatrix(int *mat, const size_t N)
{
    for(int i = 0; i < N*N; i ++)
        mat[i] = rand()%32 - 1;

}

bool CmpArray(const int *l, const int *r, const size_t eleNum)
{
    for(int i = 0; i < eleNum; i ++)
        if(l[i] != r[i])
        {
            printf("ERROR: l[%d] = %d, r[%d] = %d\n", i, l[i], i, r[i]);
            return false;
        }
    return true;
}

int main(int argc, char **argv)
{

// generate a random matrix.
size_t N = 10;
int *mat = (int*)malloc(sizeof(int)*N*N);
GenMatrix(mat, N);

// compute the reference result.
int *ref = (int*)malloc(sizeof(int)*N*N);
memcpy(ref, mat, sizeof(int)*N*N);
Floyd_sequential(ref, N);


//CUDA Portion
int Grid_Dim_x = 1, Grid_Dim_y = 1;
int noThreads_x, noThreads_y;
int *result = (int*)malloc(sizeof(int)*N*N);
memcpy(result, mat, sizeof(int)*N*N);
int *d_result;

// compute your results

cudaMalloc((void **)&d_result, N*N);

cudaMemcpy(result, N * N, cudaMemcpyHostToDevice);
gpu_Floyd<<<1024, 256>>>(d_result, N);
cudaMemcpy(result, d_result, cudaMemcpyDeviceToHost);

// compare your result with reference result
if(CmpArray(result, ref, N*N))
    printf("The matrix matches.\n");
else
    printf("The matrix do not match.\n");

free(ref);
free(result);
cudaFree(d_result);
}

However, my output always shows the matrices do not match.

I understand that in CUDA we try to map each element in the matrix to each row. However, I am trying to explore possibilities by mapping each row of the matrix to a thread instead.

Dare one ask what Floyd is? You say your results are incorrect, but you haven't explained in what way. Your code is also incomplete and someone else couldn't run it and reproduce you results. It would be good if you could fix these things, otherwise I fail to see how someone would be able to provide you with a useful answer. — talonmies, Nov 08 '13 at 14:51
Probably you should provide a full reproducer. In fact SO expects this. It's possible that your usage of `__syncthreads()` is flawed. Your code here certainly has the potential for race conditions, because the code is updating the same matrix that it is using for input. Often times new CUDA programmers mistakenly believe that `__syncthreads()` will act as a barrier for *all* threads. It does not. It only acts as a barrier for the threads in a block. You haven't shown the rest of your code, so I've no idea of your launch config, but this is one possibility. Voting to close. — Robert Crovella, Nov 08 '13 at 16:18
Reduce `Row * N` and other duplicate calculations. This improves readability alot and maybe performance in case not optimized by compiler. Have no intention on having a look at this since I loose track on second repetition. Good old advice is always to execute your algorithm on paper. Simulate with little threads and blocks and check what will happen. — djmj, Nov 09 '13 at 02:36
I dont think you understand the CUDA basic concepts. I guess you need a fresh start. [Here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/) is the link to cuda programming guide. I hope you have got the sample code with the SDK. I would suggest you to read the basics properly before you start programming. The above program doesnt even compile. The cudaMemcpy takes 4 parameter device pointer, host pointer, size in bytes (not N * N should N * N * sizeof (int)) and direction of copy. — Sagar Masuti, Nov 09 '13 at 12:48
[This code](https://github.com/OlegKonings/CUDA_Floyd_Warshall_) may be of interest — Robert Crovella, May 03 '14 at 13:55

score 2 · Accepted Answer · edited May 23 '17 at 12:15

As has already been mentioned, your provided GPU code does not compile, so I'm curious how you got to the observation that your output matrices do not match. Here are some of the problems with your code:

cudaMalloc, just like malloc allocates bytes, so this is not correct:
```
cudaMalloc((void **)&d_result, N*N);
```

instead you want this:

    cudaMalloc((void **)&d_result, N*N*sizeof(int));

likewise cudaMemcpy, just like memcpy, operates on bytes, and furthermore cudaMemcpy requires 4 parameters so this is not correct:
```
cudaMemcpy(result, N * N, cudaMemcpyHostToDevice);
```

instead you probably want this:

    cudaMemcpy(d_result, result, N * N*sizeof(int), cudaMemcpyHostToDevice);

and your other cudaMemcpy line needs to be fixed similarly.

I'd also advise doing proper cuda error checking
Your kernel is written as if it's expecting a 2 dimensional thread array, or at least one dimensional in y, whereas you are launching a one dimensional grid in x:
```
gpu_Floyd<<<1024, 256>>>(d_result, N);
```

therefore all your kernel built-in variables in y will be 1 or 0 always, and this line of code:

        int Row = blockIdx.y * blockDim.y + threadIdx.y;

will evaluate to zero for all threads in your 1-D grid in x.

Your gpu kernel is putting the results in the same matrix as the input data. For sequential code this may or may not matter, but for code that is intended to run in parallel, it can often lead to race conditions, because the order of operations (i.e. order of thread execution) is largely undefined.

score 2 · Answer 2 · answered Feb 13 '16 at 07:45

Below you will find a canonical, simple implementation of the Floyd-Warshall algorithm in CUDA.

The CUDA code is accompanied with a sequential implementation and both are based on the simplifying assumption that the edges are non-negative. The full, minimum distance paths are also reconstructed in both the cases. Despite the simplifying assumption, it should be possible to grasp the relevant parallelization idea, namely that a two-dimensional thread grid is exploited and that each thread along x is assigned to a matrix column, while each block along y is assigned to a matrix row. In this way, all the columns are loaded by the threadIdx.x == 0 threads of each block in shared memory.

// --- Assumption: graph with positive edges

#include <stdio.h>
#include <string>
#include <map>
#include <iostream>
#include <fstream>

#include "Utilities.cuh"

#define BLOCKSIZE   256

using namespace std;

map<string, int> nameToNum;                 // --- names of vertices
map<string, map<string, int>> weightMap;    // --- weights of edges 

/************************/
/* READ GRAPH FROM FILE */
/************************/
int *readGraphFromFile(int &N, char *fileName) {

    string vertex1, vertex2;                
    ifstream graphFile;
    int currentWeight;          
    N = 0;                                              // --- Init the number of found vertices
    graphFile.open(fileName);                           // --- Open the graph file

    graphFile >> vertex1;                               // --- Read first vertex
    while(vertex1 != "--END--") {                       // --- Loop untile end of file has not been found
        graphFile >> vertex2;                           // --- Read second vertex
        graphFile >> currentWeight;                     // --- Read weight between first and second vertex
        if (nameToNum.count(vertex1) == 0) {            // --- If vertex has not yet been added ...
            nameToNum[vertex1] = N;                     //     assign a progressive number to the vertex
            weightMap[vertex1][vertex1] = 0;            //     assign a zero weight to the "self-edge"
            N++;                                        // --- Update the found number of vertices
        }
        if (nameToNum.count(vertex2) == 0) {
            nameToNum[vertex2] = N;
            weightMap[vertex2][vertex2] = 0;
            N++;
        }
        weightMap[vertex1][vertex2] = currentWeight;    // --- Update weight between vertices 1 and 2
        graphFile >> vertex1;
    }    
    graphFile.close();                                  // --- Close the graph file

    // --- Construct the array
    int *weightMatrix = (int*) malloc(N * N * sizeof(int));
    // --- Loop over all the vertex couples in the wights matrix
    for (int ii = 0; ii < N; ii++)                      
        for (int jj = 0; jj < N; jj++)
            weightMatrix[ii * N + jj] = INT_MAX / 2;    // --- Init the weights matrix elements to infinity
    map<string, int>::iterator i, j;
    // --- Loop over all the vertex couples in the map
    //     (*i).first and (*j).first are the weight entries of the map, while (*i).second and (*j).second are their corresponding indices
    for (i = nameToNum.begin(); i != nameToNum.end(); ++i)
        for (j = nameToNum.begin(); j != nameToNum.end(); ++j) {
            // --- If there is connection between vertices (*i).first and (*j).first, the update the weight matrix
            if (weightMap[(*i).first].count((*j).first) != 0) 
                weightMatrix[N * (*i).second + (*j).second] = weightMap[(*i).first][(*j).first];
        }
    return weightMatrix;
}

/************************************/
/* PRINT MINIMUM DISTANCES FUNCTION */
/************************************/
void printMinimumDistances(int N, int *a) {

    map<string, int>::iterator i;

    // --- Prints all the node labels at the first row
    for (i = nameToNum.begin(); i != nameToNum.end(); ++i) printf("\t%s", i->first.c_str());

    printf("\n");

    i = nameToNum.begin();

    // --- Loop over the rows
    for (int p = 0; p < N; p++) {

        printf("%s\t", i -> first.c_str());

        // --- Loop over the columns
        for (int q = 0; q < N; q++) {
            int dd =  a[p * N + q];
            if (dd != INT_MAX / 2) printf("%d\t", dd);
            else printf("--\t");
        }

        printf("\n");

        i++;
    }
}

void printPathRecursive(int row, int col, int *minimumDistances, int *path, int N) {
    map<string, int>::iterator i = nameToNum.begin();
    map<string, int>::iterator j = nameToNum.begin();
    if (row == col) {advance(i, row); printf("%s\t", i -> first.c_str()); }
    else {
        if (path[row * N + col] == INT_MAX / 2) printf("%row %row %row No path exists\t\n", minimumDistances[row * N + col], row, col);
        else {
            printPathRecursive(row, path[row * N + col], minimumDistances, path, N);
            advance(j, col);
            printf("%s\t", j -> first.c_str());
        }
    }
}

void printPath(int N, int *minimumDistances, int *path) {

    map<string, int>::iterator i;
    map<string, int>::iterator j;

    // --- Loop over the rows
    i = nameToNum.begin();
    for (int p = 0; p < N; p++) {

        // --- Loop over the columns
        j = nameToNum.begin();
        for (int q = 0; q < N; q++) {
            printf("From %s to %s\t", i -> first.c_str(), j -> first.c_str());
            printPathRecursive(p, q, minimumDistances, path, N);
            printf("\n");
            j++;
        }

        i++;
    }
}

/**********************/
/* FLOYD-WARSHALL CPU */
/**********************/
void h_FloydWarshall(int *h_graphMinimumDistances, int *h_graphPath, const int N) {
    for (int k = 0; k < N; k++)
        for (int row = 0; row < N; row++)
            for (int col = 0; col < N; col++) {
                if (h_graphMinimumDistances[row * N + col] > (h_graphMinimumDistances[row * N + k] + h_graphMinimumDistances[k * N + col])) {
                    h_graphMinimumDistances[row * N + col] = (h_graphMinimumDistances[row * N + k] + h_graphMinimumDistances[k * N + col]);
                    h_graphPath[row * N + col] = h_graphPath[k * N + col];
                }
            }
}

/*************************/
/* FLOYD-WARSHALL KERNEL */
/*************************/
__global__ void d_FloydWarshall(int k, int *d_graphMinimumDistances, int *d_graphPath, int N) {

    int col = blockIdx.x * blockDim.x + threadIdx.x;    // --- Each thread along x is assigned to a matrix column
    int row = blockIdx.y;                               // --- Each block along y is assigned to a matrix row

    if (col >= N) return;

    int arrayIndex = N * row + col;             

    // --- All the blocks load the entire k-th column into shared memory
    __shared__ int d_graphMinimumDistances_row_k;          
    if(threadIdx.x == 0) d_graphMinimumDistances_row_k = d_graphMinimumDistances[N * row + k];
    __syncthreads();

    if (d_graphMinimumDistances_row_k == INT_MAX / 2)   // --- If element (row, k) = infinity, no update is needed
    return;

    int d_graphMinimumDistances_k_col = d_graphMinimumDistances[k * N + col]; 
    if(d_graphMinimumDistances_k_col == INT_MAX / 2)    // --- If element (k, col) = infinity, no update is needed
    return;

    int candidateBetterDistance = d_graphMinimumDistances_row_k + d_graphMinimumDistances_k_col;
    if (candidateBetterDistance < d_graphMinimumDistances[arrayIndex]) {
        d_graphMinimumDistances[arrayIndex] = candidateBetterDistance;
        d_graphPath[arrayIndex] = d_graphPath[k * N + col];
    }
}

/********/
/* MAIN */
/********/
int main() {

    int N = 0;                  // --- Number of vertices

    // --- Read graph array from file
    int *h_graphArray = readGraphFromFile(N, "graph2.txt");     
    printf("\n******************\n");
    printf("* Original graph *\n");
    printf("******************\n");
    printMinimumDistances(N, h_graphArray);

    // --- Floyd-Warshall on CPU
    int *h_graphMinimumDistances = (int *) malloc(N * N * sizeof(int));
    int *h_graphPath             = (int *) malloc(N * N * sizeof(int));
    memcpy(h_graphMinimumDistances, h_graphArray, N * N * sizeof(int));
    for (int k = 0; k < N; k++) 
        for (int l = 0; l < N; l++) 
            if (h_graphArray[k * N + l] == INT_MAX / 2) h_graphPath[k * N + l] = INT_MAX / 2;
            else h_graphPath[k * N + l] = k;

    h_FloydWarshall(h_graphMinimumDistances, h_graphPath, N);
    printf("\n*************************\n");
    printf("* CPU result: distances *\n");
    printf("*************************\n");
    printMinimumDistances(N, h_graphMinimumDistances);
    printf("\n********************\n");
    printf("* CPU result: path *\n");
    printf("********************\n");
    printPath(N, h_graphMinimumDistances, h_graphPath);

    // --- Graph array device allocation and host-device memory transfer
    int *d_graphMinimumDistances;   gpuErrchk(cudaMalloc(&d_graphMinimumDistances, N * N * sizeof(int)));
    gpuErrchk(cudaMemcpy(d_graphMinimumDistances, h_graphArray, N * N * sizeof(int), cudaMemcpyHostToDevice));
    int *d_graphPath;               gpuErrchk(cudaMalloc(&d_graphPath, N * N * sizeof(int)));
    for (int k = 0; k < N; k++) 
        for (int l = 0; l < N; l++) 
            if (h_graphArray[k * N + l] == INT_MAX / 2) h_graphPath[k * N + l] = INT_MAX / 2;
            else h_graphPath[k * N + l] = k;
    gpuErrchk(cudaMemcpy(d_graphPath, h_graphPath, N * N * sizeof(int), cudaMemcpyHostToDevice));

    // --- Iterations
    for (int k = 0; k < N; k++) {
        d_FloydWarshall <<<dim3(iDivUp(N, BLOCKSIZE), N), BLOCKSIZE>>>(k, d_graphMinimumDistances, d_graphPath, N);
#ifdef DEBUG
        gpuErrchk(cudaPeekAtLastError());
        gpuErrchk(cudaDeviceSynchronize());
#endif
    }

    // --- Copy results back to the host
    gpuErrchk(cudaMemcpy(h_graphMinimumDistances, d_graphMinimumDistances, N * N * sizeof(int), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_graphPath, d_graphPath, N * N * sizeof(int), cudaMemcpyDeviceToHost));
    printf("\n**************\n");
    printf("* GPU result *\n");
    printf("**************\n");
    printMinimumDistances(N, h_graphMinimumDistances);
    printf("\n********************\n");
    printf("* GPU result: path *\n");
    printf("********************\n");
    printPath(N, h_graphMinimumDistances, h_graphPath);

}

The Floyd-Warshall algorithm in CUDA

2 Answers2

Linked