using thrust::sort inside a thread

Question

I would like to know if thrust::sort() can be used inside a thread

__global__
void mykernel(float* array, int arrayLength)
{
    int threadID = blockIdx.x * blockDim.x + threadIdx.x;
    // array length is vector in the device global memory
    // is it possible to use inside the thread?
    thrust::sort(array, array+arrayLength);
    // do something else with the array
}

If yes, does the sort launch other kernels to parallelize the sort?

possible duplicate of [CUDA: how to use thrust::sort\_by\_key directly on the GPU?](http://stackoverflow.com/questions/15609126/cuda-how-to-use-thrustsort-by-key-directly-on-the-gpu) or [Thrust inside user written kernels](http://stackoverflow.com/questions/5510715/thrust-inside-user-written-kernels) — aland, May 01 '14 at 08:05
my question is different; i'm launching a kernel with multiple threads, inside each thread i need to sort a vector that is accessible in global/shared memory. — Boraxis, May 01 '14 at 20:16

Jared Hoberock · Accepted Answer · 2015-08-06T07:38:51.817

12

Yes, thrust::sort can be combined with the thrust::seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread):

#include <thrust/sort.h>
#include <thrust/execution_policy.h>

__global__
void mykernel(float* array, int arrayLength)
{
  int threadID = blockIdx.x * blockDim.x + threadIdx.x;

  // each thread sorts array
  // XXX note this causes a data race
  thrust::sort(thrust::seq, array, array + arrayLength);
}

Note that your example causes a data race because each CUDA thread attempts to sort the same data in parallel. A correct race-free program would partition array according to thread index.

The thrust::seq execution policy, which is required for this feature, is only available in Thrust v1.8 or better.

edited Aug 06 '15 at 07:38

answered May 01 '14 at 21:16

Jared Hoberock

11,118
3
40
76

That is correct, I would have to partition the array according to the thread index. Will this cause thread divergence. I would typically have to sort a portion of the array of 1,000 doubles(floats) – Boraxis May 09 '14 at 20:35
Are you sure that the minimum Thrust version is 1.7? I'm trying to compile an adaptation of your code on both CUDA 5.5 (which comes with Thrust v1.7) and 6.0 (which comes with Thrust v1.701), but with no success. Those version of Thrust seem to lack the `/thrust/detail/seq.h` file required in the `/thrust/execution_policy.h`. Conversely, I have installed Thrust v1.8, which does have the `/thrust/detail/seq.h` file, and I'm able to compile my adaptation successfully. Would you be so kind to clarify this point to me? – Vitality Jun 04 '14 at 20:56

score 2 · Answer 2 · edited May 23 '17 at 12:08

@aland already referred you to an earlier answer about calling Thrust's parallel algorithms on the GPU - in that case the asker was simply trying to sort data which was already on the GPU; Thrust called from the CPU can handle GPU-resident data by cast pointers to vectors.

Assuming your question is different and you really want to call a parallel sort in the middle of your kernel (as opposed to break the kernel into multiple smaller kernels and call sort in between) then you should consider CUB, which provides a variety of primitives suitable for your purposes.

Update: Also see @Jared's answer in which he explains that you can call Thrust's sequential algorithms from on the GPU as of Thrust 1.7.

using thrust::sort inside a thread

2 Answers2