Slow sorting using Thrust, CUDA

Question

I am a newbie to CUDA. I simply tried to sort an array using Thrust.

clock_t start_time = clock(); 

thrust::host_vector<int> h_vec(10);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;

thrust::sort(d_vec.begin(), d_vec.end());
//thrust::sort(h_vec.begin(), h_vec.end());

clock_t stop_time = clock(); 
printf("%f\n", (double)(stop_time - start_time) / CLOCKS_PER_SEC);

Time took to sort d_vec is 7.4s, and time took to sort h_vec is 0.4s

I am assuming its parallel computation on device memory, so shouldn't it be faster ?

Most likely you are measuring the [context creation time](http://stackoverflow.com/q/10415204/5085250). Additionally you cannot expect that *small* vectors are sorted faster on highly parallel architectures. Try with vector sizes >> 10000. — havogt, Jul 05 '16 at 09:05
Yes, you are right, these timings were on first execution. Furthermore, I tried with 50k points, and I got timing on both host and device to be 0.12s. Difference gets large when size is close to 100000. Can I assume that sorting using h_vec is on CPU ? — Syed, Jul 05 '16 at 09:54
Yes, sorting on `h_vec` is done on the host. Perhaps you should read the [thrust quick start guide](https://github.com/thrust/thrust/wiki/Quick-Start-Guide), which discusses dispatch of thrust algorithms. — Robert Crovella, Jul 05 '16 at 13:40

score 3 · Accepted Answer · edited May 23 '17 at 11:58

Probably the main problem is context creation time: the first CUDA call will initialize the CUDA context which takes some time, see here. Therefore you should start measuring time only after the first CUDA call.

In general you can only expect speed-up with GPU code compared to CPU code if the degree of parallelism is high enough. The vector size of 10 as in the example code is definitely too small to achieve speed-up. With a vector size >> 10000 you can expect to fully utilize a modern GPU.

You should also think about measuring only the time for sorting without the copy d_vec = h_vec, since often you will work with the device vector in the next step. Then you can consider the copy operation as a one time setup cost. (However if sorting is the only operation on device it is of course reasonable to include the memcopy in the measurement.)

Slow sorting using Thrust, CUDA

1 Answers1