Thrust is very slow for array reduction

Question

I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:

#include<chrono>
#include<iostream>

#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>


int main()
{
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    
    auto start = std::chrono::high_resolution_clock::now();
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);

    std::cout<<duration.count()<<" ";
    std::cout<<sum;
}

The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?

EDIT 1: Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:

#include<chrono>
#include<iostream>

#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>


int main()
{
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    
    auto start = std::chrono::steady_clock::now();
    
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);

    std::cout<<duration.count()<<" ";
    std::cout<<sum;
}

Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:

nvcc -o reduction reduction.cu

To run:

./reduction

Side note: Careful with `high_resolution_clock`. If you aren't careful you might find it's behaviour unfortunate. For example it can be implemented as an alias of `system_clock` and get results where time goes backwards. Worse, a Windows desktop is not built to give high precision timing. Sometimes a task can be finished in microseconds and still report a full timer tick of close to 16 ms because that's how often the clock's updated. — user4581301, Mar 18 '22 at 06:33
In fact, [here's what one of the people most responsible for the creation of `high_resolution_clock` has to say](https://stackoverflow.com/a/37440647/4581301) about using it. — user4581301, Mar 18 '22 at 06:42
While thrust is simple, it is unfortunately often not very performant. Try CUB instead that should be able to saturate (memory of) the GPU. Note that CUB is now a dependency of thrust is the latest version AFAIK. — Jérôme Richard, Mar 18 '22 at 08:54
Maybe you should also include your measurement code for the reference solution. — paleonix, Mar 18 '22 at 10:13
I get ~500us running your code. So yes there is something wrong with your test case. Perhaps you are building a debug project for the thrust case. When I build a debug version of your code I get about a 5ms report. — Robert Crovella, Mar 18 '22 at 14:10
@paleonix There's no separate code I'm using for measurement, I just used chrono to find and display the duration in the code itself — thePhantom, Mar 18 '22 at 16:47
@user4581301 I removed high_resolution_clock and replaced it with steady_clock. It runs in 2ms now. — thePhantom, Mar 18 '22 at 16:50
@RobertCrovella I'm not sure what you mean. I'm just using the nvcc compiler to compile and run the program, without enabling debug mode. Can you please elaborate? — thePhantom, Mar 18 '22 at 16:53
What I mean is, if you are compiling with the device debug switch (`-G`) then performance of CUDA code may be impacted. For performance related questions, I usually suggest that people provide, in addition to what you have provided so far, the operating environment in use (e.g. Linux or windows, as well as CUDA versions) and the exact compile command being used. — Robert Crovella, Mar 18 '22 at 17:06
When you edit the question to include the extra details and build command feel free to update the code to your current version. — user4581301, Mar 18 '22 at 17:28

Robert Crovella · Accepted Answer · 2022-03-19T19:23:55.410

Am I doing something wrong here?

I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.

The profiler tells me that the thrust::reduce is accomplished under the hood via a call to cub::DeviceReduceKernel followed by cub::DeviceReduceSingleTileKernel. This two-kernel approach should make sense to you if you have studied Mark Harris' reduction material. The profiler tells me that together, these two calls account for ~40us of the ~500us overall time. This is the time that would be most comparable to the measurement you made of your implementation of Mark Harris' reduction code, assuming you are timing the kernels only. If we multiply by 4 to account for the overall perf ratio, it is pretty close to your 150us measurement of that.
The profiler tells me that the big contributors to the ~500us reported time in my case are a call to cudaMalloc (~200us) and a call to cudaFree (~200us). This isn't surprising because if you study the cub::DeviceReduce methodology that is evidently being used by thrust, it requires an initial call to do a temporary allocation. Since thrust provides a self-contained call for thrust::reduce, it has to perform that call, as well as do a cudaMalloc and cudaFree operation for the indicated temporary storage.

So is there anything that can be done?

The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.

One way to do this would be to switch from thrust to cub.

The thrust way to do it is to use a thrust custom allocator.

There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.

Thrust is very slow for array reduction

1 Answers1