I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::high_resolution_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?
EDIT 1: Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::steady_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:
nvcc -o reduction reduction.cu
To run:
./reduction