radix sort performance on K10 GPU

Question

I'm looking for a fast implementation of a sort algorithm on GPU for large arrays (hundreds M elements). I've tried the cudpp one already and got between 450M and 500M 4 bytes keys + 4 byte field per second. That did not look bad, however still in the ballpark of what a CPU can do. Then I stumbled into this: https://code.google.com/p/back40computing/wiki/RadixSorting claiming 700M keys + values/sec on a GTX480. I said - wow! - I'm running a Tesla K10, so way more powerful hardware, have to try this! Got the code, compiled it for nvidia capability 30, tried it... I get more or less the same numbers as the cudpp code. Digging more, looks like cudpp uses the radix sort from Thrust, and the bc40 algorithm has been incorporated in Thrust so all in all I may well be running the same code. I have been playing with a number of parameters (block size, grid size, etc.) on the bc40 code but only managed to make things worse. So here is the question - has anybody tested either the cudpp radix sort or the bc40 radix sort on a different (more powerful) GPU? Anywhere near 700M/sec keys+values? Any magic button to push? The nsight profiler reports a miserable 25% of GPU usage (and shared memory access as the bottleneck)...

Loos like CUB has updated code and performance figures for newer GPUs. http://nvlabs.github.io/cub/index.html Will test and come back... — user3030851, Mar 17 '15 at 22:21

radix sort performance on K10 GPU

0 Answers0