As an exercise to learn about std::async
I wrote a small program that calculates the sum of a large vector<int>
, distributed about a lot of threads.
My code below is as follows
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
typedef unsigned long long int myint;
// Calculate sum of part of the elements in a vector
myint partialSum(const std::vector<myint>& v, int start, int end)
{
myint sum(0);
for(int i=start; i<=end; ++i)
{
sum += v[i];
}
return sum;
}
int main()
{
const int nThreads = 100;
const int sizePerThread = 100000;
const int vectorSize = nThreads * sizePerThread;
std::vector<myint> v(vectorSize);
std::vector<std::future<myint>> partial(nThreads);
myint tot = 0;
// Fill vector
for(int i=0; i<vectorSize; ++i)
{
v[i] = i+1;
}
std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
// Start threads
for( int t=0; t < nThreads; ++t)
{
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
}
// Sum total
for( int t=0; t < nThreads; ++t)
{
myint ps = partial[t].get();
std::cout << t << ":\t" << ps << std::endl;
tot += ps;
}
std::cout << "Sum:\t" << tot << std::endl;
std::chrono::steady_clock::time_point endTime = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() <<std::endl;
}
My question is concerned about the calls to the function partialSum
, and then especially how the large vector is passed. The function is called as follows:
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
with the definition as follows
myint partialSum(const std::vector<myint>& v, int start, int end)
With this approach, the calculation is relatively slow. If I use std::ref(v)
in the std::async
function call, my function is a lot quicker and more efficient. This still makes sense to me.
However, if I still call by v
, instead of std::ref(v)
, but replace the function with
myint partialSum(std::vector<myint> v, int start, int end)
the program also runs a lot quicker (and uses less memory). I don't understand why the const ref implementation is slower. How does the compiler fix this without any references in place?
With the const ref implementation this program typically takes 6.2 seconds to run, without 3.0. (Note that with const ref, and std::ref
it runs in 0.2 seconds for me)
I am compiling with g++ -Wall -pedantic
using (adding the -O3
when passing just v
demonstrates the same effect)
g++ --version
g++ (Rev1, Built by MSYS2 project) 6.3.0 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.