Sorting algorithm with Cuda. Inside or outside kernels?

Question

I have a matrix of size 50000x100 and I need to sort each row using Cuda in C++. My architecture is a K80 NVidia card.

Since the number of columns is small, I am currently running the sorting algorithm inside a kernel. I am using a modified bubble algorithm that runs on all lines of the matrix.

I am wondering if there is an more efficient way to proceed. I tried to use thrust::sort inside my kernel but it is much slower. I also tried a merge sort algorithm but the recursive part of the algorithm didn't work inside my kernel.

==edit==

here is my kernel:

__global__ void computeQuantilesKernel(float *matIn, int nRows, int nCols, int nQuantiles, float *outsideValues, float *quantilesAve, int param2)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float values[100];//big enough for 100 columns
    int keys[100];
    int nQuant[100];//big enough for 100 quantiles (percentiles)
    float thisQuantile[100];
    int quant;

    if (idx >= nRows) return;

    //read matIn from global memory
    for (int i = 0; i < nCols; i++)
    {
        values[i] = matIn[idx * nCols + i + param2 * nCols * nRows];
        keys[i] = i;
    }

    //bubble Sort:
    int i, j;
    int temp;
    float tempVal;

    for (i = 0; i < nCols - 1; i++)
    {
        for (j = 0; j < nCols - i - 1; j++)
        {
            if (values[j + 1] < values[j])      // ascending order simply changes to <
            {
                tempVal = values[j];             // swap elements
                temp = keys[j];             // swap elements
                values[j] = values[j + 1];
                keys[j] = keys[j + 1];
                values[j + 1] = tempVal;
                keys[j + 1] = temp;
            }
        }
    }
    //end of bubble sort

    //reset nQuant and thisQuantile
    for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
    {
        nQuant[iQuant] = 0;
        thisQuantile[iQuant] = 0;
    }

    //Compute sum of outsideValues for each quantile
    for (int i = 0; i < nCols; i++)
    {
        quant = (int)(((float)i + 0.5) / ((float)nCols / (float)nQuantiles));//quantile like Matlab
        nQuant[quant]++;
        thisQuantile[quant] += outsideValues[idx * nCols + keys[i]];
    }

    //Divide by the size of each quantile to get averages
    for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
    {
        quantilesAve[idx + nRows * iQuant + param2 * nQuantiles * nRows] = thisQuantile[iQuant] / (float)nQuant[iQuant];
    }
}

please post a [mcve] as a baseline that solutions can be compared against — m.s., Mar 06 '17 at 08:21
try a [vectorized sort](http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix). I think your question could arguably be considered a duplicate of that one. — Robert Crovella, Mar 06 '17 at 15:46
@Robert, thank you very much for the link. I am sorry I missed it before posting. — Philippe Huber, Mar 06 '17 at 19:20

CygnusX1 · Accepted Answer · 2017-03-07T22:35:06.727

Your code as it stands uses a single thread to handle each of your rows separately. As a result you are starving for quick scratch memory (registers, L1 cache, shared memory). You are allocating at least 1600 bytes per each thread - that is a lot! You want to stay at around 128 bytes per thread (32 registers of 32 bits each). Secondly, you are using local arrays addressable at run-time -- those arrays will be spilled into local memory, trash your L1 cache and end up in global memory again (1600B x 32 threads gives 51KB, which is already at or above the limits of shmem/L1).

For that reason I would suggest handling a single row per block of 64 or 128 threads instead, and keep the row you sort in shared memory. Bubble sort is actually very easy to implement in parallel:

__shared__ float values[nCols];
... load the data ...
__syncthreads();
for (int i = 0; i < nCols/2; i++)
{
    int j = threadIdx.x;
    if (j % 2 == 0 && j<nCols-1)
        if (values[j+1] < values[j])
            swap(values[j+1], values[j]);
    __syncthreads();
    if (j % 2 == 1 && j<nCols-1)
        if (values[j+1] < values[j])
            swap(values[j+1], values[j]);
    __syncthreads();
}

Notice how your inner for j = ... loop is replaced by threadIdx, but the core idea of the algorithm stays the same. In each iteration I perform bubble swap first only on even pairs and then only on odd pairs to avoid parallel conflicts.

I assume that nCols is lower than the dimension of your block, which for 100 elements is easily achievable.

There are many ways that the above code can be improved further, for example

Cut the thread count in half and assume j=threadIdx.x*2 for the first half of the loop, and j=threadIdx.x*2+1 for the second half. This way no thread stays idle.
Use only 32 threads, each handling two values of j sequentially. This way your problem will fit a single warp, allowing you to drop __syncthreads() altogether. With 32 threads, you might be able to use warp shuffle intrinsics.
Experiment with #pragma unroll, although the amount of produce code may be unfeasible. Profiling will help.

Also consider experimenting with hardcoded merge sort instead of bubble sort. If my memory serves me right, when I implemented a warp-sized bubble sort and merge-sort with all loops unrolled, merge sort performed almost twice as fast as bubble sort. Note, it was several years ago, on the first generation of CUDA-capable cards.

This is a fantastic answer. I will try to implement it tomorrow and see if I understand everything you wrote. — Philippe Huber, Mar 31 '17 at 20:03
After trying your suggestions, I rewrote my kernel to handle a single row per blocks of 32 threads in which I sequentially run 4 columns. I changed my mind with the hard coded sorting algorithm and went for CUB's blockRadixSort. Altogether, I could lower the computing time by 60%. Once again, @CygnusX1, thank you very much. You made me learn a few things that will be very helpful in the future. — Philippe Huber, Apr 05 '17 at 12:32

Sorting algorithm with Cuda. Inside or outside kernels?

1 Answers1