0

I was looking to use thrust in my cuda application and I executed the following simple test to see the performance of thrust::sort

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>

int main()
{
    int min = 1;
    int max = 1024*1024;
    int n = 1024*1024;

    thrust::host_vector<int> h_input(n);
    thrust::host_vector<int> h_keysSorted(n);

    //fill host input with random data
    for(int i=0; i<n; i++){
        h_input[i] = min + (rand() % (int)(max - min + 1));
    }

    thrust::device_vector<int> d_input(n);

    float elapsedTime;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);
    d_input= h_input;
    thrust::sort(d_input.begin(), d_input.end());

    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&elapsedTime, start, stop);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    thrust::copy(d_input.begin(), d_input.end(), h_keysSorted.begin());
    std::cout<<"Elapsed time: "<<elapsedTime<<std::endl;
}

Aside from the excessively long compiling and cuda context creation, the above code took just over 200ms to sort 1048576 integers on my gtx 770m. This is horrible. For example the paper indicates timings just under 2 ms for sorting arrays of the same size and I found cpu timings that took less than 200 ms.

I assume that I am doing something obviously wrong but I cant see what it is. Does anyone know why thrust is taking so long? What am I doing wrong?

James
  • 398
  • 1
  • 6
  • 19
  • windows or linux? If on windows, did you compile and run a debug project or a release project? – Robert Crovella Feb 12 '17 at 04:54
  • @RobertCrovella I am running on Ubuntu and compiling from the command line – James Feb 12 '17 at 04:55
  • what is your compile command? – Robert Crovella Feb 12 '17 at 04:56
  • Thank you I found the error from your comment. It appears that including the -g -G command line option results in the poor performance. Without them the code runs in about 7 ms. Still slower that the paper but much more reasonable. I didnt think about the -g -G option! Thanks. – James Feb 12 '17 at 05:01
  • Yes. Compiling with `-G` generally results in much slower code execution. Also, you are timing the host->device copy of the data here: `d_input= h_input;` Most of the published timings I am aware of do not include that in the timing. You can realize numbers comparable to published numbers by moving the `cudaEventRecord(start,0);` statement after that line. Of course, your GPU is not as fast as the one used for the published numbers you reference, so I still don't think you'll see ~2ms. – Robert Crovella Feb 12 '17 at 05:03
  • You are correct when I move the d_input=h_input outside I now get under 6 ms. Factoring in my slower gpu probably makes up most of the difference with the paper. If you want to post the answer not to use -G option I will accept it for others to see. Otherwise I can just delete this question. Thanks again. – James Feb 12 '17 at 05:06
  • On a Quadro K5000 GPU (GK104 - faster than your GTX 770m -- approximately double the memory bandwidth) I get around 2.15 ms for your code running on linux with CUDA 7.5, with the `cudaEventRecord` statement moved as I suggested. – Robert Crovella Feb 12 '17 at 05:14

0 Answers0