Sorting multiple arrays using CUDA/Thrust

Question

I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11}, where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.

I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.

I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.

The way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key. You can use `thrust::sort_by_key` to help with this but it's not mandatory. However I think its unlikely to give a signficant speed-up over a pure sort, although you can try it. If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array. — Robert Crovella, Jun 06 '20 at 14:26
A misleading comment, sorry. What I had in mind was back-to-back `stable_sort_by_key` as indicated [here](https://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix/28254765#28254765). The advantage of using the sort_by_key method is that it may allow thrust to use the radix-sort fast path. The sort functor method will not, I don't think. Furthermore, the sort_by_key method won't really take advantage of the ordering of the sub-arrays. It works correctly even if the sub-arrays are not ordered, if your intent is to do a parallel (i.e. vectorized) sort. — Robert Crovella, Jun 06 '20 at 15:44
In fact, now that I think about it, neither thrust method I suggested takes advantage of the ordering of the sub-arrays, arguably. They are simply vectorized sorts. — Robert Crovella, Jun 06 '20 at 15:47

Robert Crovella · Accepted Answer · 2020-06-06T16:53:47.030

A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.

Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.

As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.

However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)

If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.

Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:

$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

struct my_sort_functor
{
        template <typename T, typename T2>
        __host__ __device__
        bool operator()(T t1, T2 t2){
                if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
                if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
                if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
                return true;}
};

// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
    typename    Key,
    int         BLOCK_THREADS,
    int         ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
    Key         *d_in,          // Tile of input
    Key         *d_out)         // Tile of output
{
    enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
    // Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
    typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
    // Specialize BlockRadixSort type for our thread block
    typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
    // Shared memory
    __shared__ union TempStorage
    {
        typename BlockLoadT::TempStorage        load;
        typename BlockRadixSortT::TempStorage   sort;
    } temp_storage;
    // Per-thread tile items
    Key items[ITEMS_PER_THREAD];
    // Our current block's offset
    int block_offset = blockIdx.x * TILE_SIZE;
    // Load items into a blocked arrangement
    BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
    // Barrier for smem reuse
    __syncthreads();
    // Sort keys
    BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
    // Store output in striped fashion
    StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}

int main(){
        const int ds = num_blocks*block_size;
        thrust::host_vector<mt>      data(ds);
        thrust::host_vector<int>     keys(ds);
        for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
        thrust::device_vector<int> d_keys = keys;
        for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size;  // populate data
        thrust::device_vector<mt>  d_data = data;
        thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin());  // fill out keys array  000111222...
        thrust::device_vector<mt> d1 = d_data;  // make a copy of unsorted data
        cudaDeviceSynchronize();
        unsigned long long os = dtime_usec(0);
        thrust::sort(d1.begin(), d1.end());  // ordinary sort
        cudaDeviceSynchronize();
        os = dtime_usec(os);
        thrust::device_vector<mt> d2 = d_data;  // make a copy of unsorted data
        cudaDeviceSynchronize();
        unsigned long long ss = dtime_usec(0);
        thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
        cudaDeviceSynchronize();
        ss = dtime_usec(ss);
        if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
        std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
        thrust::device_vector<mt> d3(ds);
        cudaDeviceSynchronize();
        unsigned long long cs = dtime_usec(0);
        BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()),  thrust::raw_pointer_cast(d3.data()));
        cudaDeviceSynchronize();
        cs = dtime_usec(cs);
        if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
        std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$

(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)

I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.

I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.

Sorting multiple arrays using CUDA/Thrust

1 Answers1