I have been searching around a bit and I think the following is a candidate answer to your question.
Quoting N. Bell and J. Hoberock, "Thrust: a productivity-oriented library for CUDA", in GPU Computing Gems Jade Edition:
Thrust statically selects a highly-optimized Radix Sort algorithm for sorting primitive types (char
, int
, float
and double
) with the standard less
and greater
comparison operators. For all other types (e.g., user-defined data types) and comparison operators, Thrust uses a general Merge Sort algorithm. Because sorting primitives with Radix Sort is considerably faster than Merge Sort, this static optimization has significant value.
Now, Merge Sort requires O(N)
memory space, see Space requirements of a merge-sort.
Furthermore, Radix Sort still requires O(N)
memory space, see Comparison of Bucket Sort and RADIX Sort.
Which of the two consumes more memory is not defined and depends on the input sequence to be sorted as well as on algorithm tuning parameters, see the comments to one of the answers to Why quicksort is more popular than radix-sort?.
Opposite to that, Quick Sort requires O(logN)
memory space if performed in-place, otherwise it needs O(N)
. For a CUDA implementation of the Quick Sort algorithm, you may have a look at How Tesla K20 Speeds Quicksort.
For other in-place sorting algorithms (the in-place strategy is worth to be explored as it saves memory as compared to the non-in-place counterpart), have a look at the Bitonic Sort, see Fast in-place sorting with CUDA based on bitonic sort.