Using thrust with openmp: no substantial speed up obtained

Question

I am interested in porting a code I had written using mostly the Thrust GPU library to multicore CPU's. Thankfully, the website says that thrust code can be used with threading environments such as OpenMP / Intel TBB.

I wrote a simple code below for sorting a large array to see the speedup using a machine which can support upto 16 Open MP threads.

The timings obtained on this machine for sorting a random array of size 16 million are

STL : 1.47 s
Thrust (16 threads) : 1.21 s

There seems to be barely any speed-up. I would like to know how to get a substantial speed-up for sorting arrays using OpenMP like I do with GPUs.

The code is below (the file sort.cu). Compilation was performed as follows:

nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp

The NVCC version is 5.5 The Thrust library version being used is v1.7.0

#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>    
#include <ctime>
#include <time.h>
#include "thrust/sort.h"    

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
     }
  std::cout << "-------------\n";

  clock_t start,stop;
  start=clock();
  std::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "Time taken for sorting the array with STL  is " << (stop-start)/(double)CLOCKS_PER_SEC;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
      //std::cout << myarr[i] << std::endl;
     }

  start=clock();
  thrust::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << (stop-start)/(double)CLOCKS_PER_SEC;
  return 0;
}

Don't use clock(), use `omp_get_wtime()`. Sorting is something I have been meaning to look into but since it's `nlog(n)` my guess is that the operation is memory bandwidth bound and so it can't benefit much from multiple fast cores. The situation on the GPU (or XeonPhi) is different because the ratio between "core" speed and memory speed is much lower. — Z boson, Oct 18 '14 at 09:40

score 2 · Answer 1 · edited May 23 '17 at 11:45

The device backend refers to the behavior of operations performed on a thrust::device_vector or similar reference. Thrust interprets the array/pointer you are passing it as a host pointer, and performs host-based operations on it, which are not affected by the device backend setting.

There are a variety of ways to fix this issue. If you read the device backend documentation you will find general examples and omp-specific examples. You could even specify a different host backend which should have the desired behavior (OMP usage) with your code, I think.

Once you fix this, you'll get an additional result surprise, perhaps: thrust appears to sort the array quickly, but reports a very long execution time. I believe this is due (on linux, anyway) to the clock() function being affected by the number of OMP threads in use.

The following code/sample run has those issues addressed, and seems to give me a ~3x speedup for 4 threads.

$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
     }
  std::cout << "-------------\n";

  timeval t1, t2;
  gettimeofday(&t1, NULL);
  std::sort(myarr,myarr+N);
  gettimeofday(&t2, NULL);
  float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "Time taken for sorting the array with STL  is " << et << std::endl;;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
      //std::cout << myarr[i] << std::endl;
     }
  thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
  gettimeofday(&t1, NULL);
  thrust::sort(darr,darr+N);
  gettimeofday(&t2, NULL);
  et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << et << std::endl   ;
  return 0;
}

$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL  is 1.31956
------------------
Time taken for sorting the array with Thrust  is 0.468176
$

Your mileage may vary. In particular, you may not see any improvement as you go above 4 threads. There may be a number of factors which prevent an OMP code from scaling beyond a certain number of threads. Sorting generally tends to be a memory-bound algorithm, so you will probably observe an increase until you have saturated the memory subsystem, and then no further increase from additional cores. Depending on your system, it's possible you could be in this situation already, in which case you may not see any improvement from OMP style multithreading.

Isn't sorting a `nlog(n)` operation so it should be memory bandwidth bound and not benefit much from multiple fast cores (on a single socket system). But I guess your result shows that I'm wrong... — Z boson, Oct 18 '14 at 09:42
Actually, memory bandwidth is the most likely reason that the performance increases up to about 4 cores but then levels out. A modern Intel Xeon cannot saturate the memory bus from requests originating from a single core. In order to use the full memory bandwidth available on a single Xeon socket, it's necessary to issue requests from multiple cores. But approximately 4 cores should be enough to saturate the bus attached to a single socket. My processor in this case was an intel Ivybridge Xeon. — Robert Crovella, Oct 18 '14 at 12:40
I agree multiple threads are necessary to maximize the bandwidth even on a single socket system, I figured that out at [measuring-memory-bandwidth-from-the-dot-product-of-two-arrays](http://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). But in my experience multiple threads increases the bandwidth by by less than a factor of two and you get a factor of three which is more than I would expect. — Z boson, Oct 18 '14 at 12:50
How many real cores does your your IvyBridge Xeon have? What exactly is the processor? — Z boson, Oct 18 '14 at 13:05
It's a Xeon E5-2697 v2, 12 cores. And on my system, the single thread comparison between thrust and STL is that STL is about 1.32s vs. 0.88s for thrust (OP gets a different ratio). So really I'm only getting about 2x from the multi-threaded contribution, which is probably in line with the available memory bandwidth increase. — Robert Crovella, Oct 18 '14 at 14:21
Oh, so it makes sense after all. Of course to compare apples to apples we should compare thrust::sort single thread (and not std::sort) vs. multi threaded and not. One last question (I think). Is my instinct correct that the scaling is better on the GPU/Xeoon-Phi where there "cores" are slower relative to memory? — Z boson, Oct 18 '14 at 22:50
The scaling is better on accelerators because there is more memory bandwidth, in my opinion. If you prefer to phrase that another way, go ahead. But ultimately it is the memory bandwidth that determines peak perf for a memory bound algorithm. I've given you all the code needed to compare thrust::sort single thread instead of std::sort. Just change the OMP_NUM_THREADS environment variable, which I've already given an example of in my answer. — Robert Crovella, Oct 18 '14 at 23:17
The example gives no real difference for me, regardless of what I set `OMP_NUM_THREADS` to. — user2023370, Mar 05 '16 at 17:59

Using thrust with openmp: no substantial speed up obtained

1 Answers1