3

In cuda/thrust: Trying to sort_by_key 2.8GB of data in 6GB of GPU RAM throws bad_alloc, I have read that sort_by_key consumed most of the memory for the test case considered therein.

Is there an alternative that can do exactly what sort_by_key is doing even if it is a little bit slower but that can sort bigger datasets?

Community
  • 1
  • 1
Walter white
  • 173
  • 2
  • 11
  • 2
    I do not have a deep knowledge of sorting algorithms on GPU. I remember that [Chapter 46 of GPU Gems 2](http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html) discusses different sorting algorithms. Perhaps it will give you some useful information on alternative, less performing, but memory saving approaches. – Vitality Jun 30 '14 at 16:49
  • 2
    As you've discovered, thrust lets you use about 40% of available GPU memory for sort data. I looked at some other approaches like [CUB](http://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html) but I don't think you'll be able to exceed 50% with any other approaches that I'm aware of. You could sort by chunks on the device, then transfer each chunk to the host and do a `thrust::merge_by_key` there, which might be less costly than doing a full sort on the host, but the host<->device transfer times would probably make that approach un-interesting as well. – Robert Crovella Jun 30 '14 at 20:56
  • Do you have any a priori information that you could expoit in your sorting procedure? – Vitality Jun 30 '14 at 21:31
  • Thanks a lot :)! Jack and Robert. – Walter white Jun 30 '14 at 21:40
  • Jack, I do not know if I get your question sorry. I just try to sort my vector using this procedure: thrust::sort_by_key(Keys.begin(), Keys.end(), tempVector.begin()); It crashed in the case when I try to compute 1.8 gb in a Titan node of 6.0gb. Keys and tempVector have the same size. I can try to check the approach that you posted or the CUB from Robert and check if I can sort a bit more of Data. – Walter white Jun 30 '14 at 21:50
  • What I'm trying to say is that library sorting algorithms must be general, namely, must work for arbitrary arrays. Opposite to that, if you have any information on the vector that you have to sort (for example, it is partially sorted), you may try to trade-off generality with efficiency: you could set up an algorithm that works only for a certain class of arrays, but can save some temporary memory allocation space. – Vitality Jul 01 '14 at 05:25

1 Answers1

3

I have been searching around a bit and I think the following is a candidate answer to your question.

Quoting N. Bell and J. Hoberock, "Thrust: a productivity-oriented library for CUDA", in GPU Computing Gems Jade Edition:

Thrust statically selects a highly-optimized Radix Sort algorithm for sorting primitive types (char, int, float and double) with the standard less and greater comparison operators. For all other types (e.g., user-defined data types) and comparison operators, Thrust uses a general Merge Sort algorithm. Because sorting primitives with Radix Sort is considerably faster than Merge Sort, this static optimization has significant value.

Now, Merge Sort requires O(N) memory space, see Space requirements of a merge-sort.

Furthermore, Radix Sort still requires O(N) memory space, see Comparison of Bucket Sort and RADIX Sort.

Which of the two consumes more memory is not defined and depends on the input sequence to be sorted as well as on algorithm tuning parameters, see the comments to one of the answers to Why quicksort is more popular than radix-sort?.

Opposite to that, Quick Sort requires O(logN) memory space if performed in-place, otherwise it needs O(N). For a CUDA implementation of the Quick Sort algorithm, you may have a look at How Tesla K20 Speeds Quicksort.

For other in-place sorting algorithms (the in-place strategy is worth to be explored as it saves memory as compared to the non-in-place counterpart), have a look at the Bitonic Sort, see Fast in-place sorting with CUDA based on bitonic sort.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146