Why is passing by const ref slower when using std::async

Question

As an exercise to learn about std::async I wrote a small program that calculates the sum of a large vector<int>, distributed about a lot of threads.

My code below is as follows

#include <iostream>
#include <vector>
#include <future>
#include <chrono>

typedef unsigned long long int myint;

// Calculate sum of part of the elements in a vector
myint partialSum(const std::vector<myint>& v, int start, int end)
{
    myint sum(0);
    for(int i=start; i<=end; ++i)
    {
        sum += v[i];
    }
    return sum;
}

int main()
{
    const int nThreads = 100;
    const int sizePerThread = 100000;
    const int vectorSize = nThreads * sizePerThread;
    
    std::vector<myint> v(vectorSize);   
    std::vector<std::future<myint>> partial(nThreads);
    myint tot = 0;
    
    // Fill vector  
    for(int i=0; i<vectorSize; ++i)
    {
        v[i] = i+1;
    }
    std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
    
    // Start threads
    for( int t=0; t < nThreads; ++t)
    {
        partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);               
    }
    
    // Sum total
    for( int t=0; t < nThreads; ++t)
    {
        myint ps = partial[t].get();
        std::cout << t << ":\t" << ps << std::endl;
        tot += ps;
    }
    std::cout << "Sum:\t" << tot << std::endl;
    
    std::chrono::steady_clock::time_point endTime = std::chrono::steady_clock::now();
    std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() <<std::endl;
}

My question is concerned about the calls to the function partialSum, and then especially how the large vector is passed. The function is called as follows:

        partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);

with the definition as follows

myint partialSum(const std::vector<myint>& v, int start, int end)

With this approach, the calculation is relatively slow. If I use std::ref(v) in the std::async function call, my function is a lot quicker and more efficient. This still makes sense to me.

However, if I still call by v, instead of std::ref(v), but replace the function with

myint partialSum(std::vector<myint> v, int start, int end)

the program also runs a lot quicker (and uses less memory). I don't understand why the const ref implementation is slower. How does the compiler fix this without any references in place?

With the const ref implementation this program typically takes 6.2 seconds to run, without 3.0. (Note that with const ref, and std::ref it runs in 0.2 seconds for me)

I am compiling with g++ -Wall -pedantic using (adding the -O3 when passing just v demonstrates the same effect)

g++ --version

g++ (Rev1, Built by MSYS2 project) 6.3.0 Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You should not care too much about the performance of unoptimized code. First use the flag `-O3`, then decide if the performance is good enough and if it isn't use a profiler to figure out what needs improving. — nwp, Nov 22 '17 at 14:18
Without `std::ref` your argument is copied, regardless of what your function expects. — François Andrieux, Nov 22 '17 at 14:18
Possible duplicate of [Why does std::async copy its const & arguments?](https://stackoverflow.com/questions/14851163/why-does-stdasync-copy-its-const-arguments) — François Andrieux, Nov 22 '17 at 14:19
I've retracted my duplicate vote because I realize that the question isn't an exact duplicate. It doesn't answer why by-value appears faster than by-const-reference. — François Andrieux, Nov 22 '17 at 14:21
@nwp If I add the `-O3` flag, I obtain more or less the same differences. — Bernhard, Nov 22 '17 at 14:24
Can we get some hard numbers on "slow" and "fast"? Perhaps some statistical significance? — AndyG, Nov 22 '17 at 14:28
There are no measurements in the code. What and how do you measure? — Maxim Egorushkin, Nov 22 '17 at 16:07
@AndyG I now added the timing details. I already got a useful answer, but perhaps you find this interesting. — Bernhard, Nov 24 '17 at 06:47

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The short story

given a copy and move-constructible type T

V f(T);
V g(T const&);

T t;
auto v = std::async(f,t).get();
auto v = std::async(g,t).get();

the only relevant difference concerning the two async calls is that, in the first one, t's copy is destroyed as soon as f returns; in the second, t's copy may be destroyed as per effect of the get() call. If the async calls happen in a loop with the future being get() later, the first will have constant memory on avarage (assuming a constant per-thread workload), the second a linearly growing memory at worst, resulting in more cache hits and worse allocation performance.

The long story

First of all, I can reproduce the observed slowdown (consistently ~2x in my system) on both gcc and clang; moreover, the same code with equivalent std::thread invocations does not manifest the same behavior, with the const& version turning out slightly faster as expected. Let's see why.

Firstly, the specification of async reads:

[futures.async] If launch::async is set in policy, calls INVOKE(DECAY_COPY(std::forward(f)), DECAY_COPY(std::forward(args))...) (23.14.3, 33.3.2.2) as if in a new thread of execution represented by a thread object with the calls to DECAY_COPY() being evaluated in the thread that called async[...]The thread object is stored in the shared state and affects the behavior of any asynchronous return objects that reference that state.

so, async will copy the arguments forwarding those copies to the callable, preserving rvalueness; in this regard, it's the same as of std::thread constructor, and there's no difference in the two OP versions, both copy the vector.

The difference is in the bold part: the thread object is part of the shared state and will not be freed until the latter gets released (eg. by a future::get() call).

Why is this important ? because the standard does not specify to who the decayed copies are bound, we only know that they must outlive the callable invocation, but we don't know if they will be destroyed immediately after the call or at thread exit or when the thread object is destroyed (along with the shared state).

In fact, it turns out that gcc and clang implementations store the decayed copies in the shared state of the resulting future.

Consequently, in the const& version the vector copy is stored in the shared state and destroyed at future::get: this results in the "Start threads" loop allocating a new vector at each step, with a linear growth of memory.

Conversely, in the by-value version the vector copy is moved in the callable argument and destroyed as soon as the callable returns; at future::get, a moved, empty vector will be destroyed. So, if the callable is fast enough to destroy the vector before a new one is created, the same vector will be allocated over and over and memory will stay almost constant. This will result in less cache hits and faster allocations, explaining the improved timings.

score 0 · Answer 2 · answered Nov 22 '17 at 14:49

As people said, without std::ref the object is being copied.

Now I believe that the reason that passing by value is actually faster might have something to do with the following question: Is it better in C++ to pass by value or pass by constant reference?

What might happen, that in the internal implementation of async, the vector being copied once to the new thread. And then internally being passed by reference, to a function which take ownership of the vector, which mean it will than be copied once again. On the other hand, if you pass it by value, it will copy it once to the new thread, but than will move it twice inside the new thread. Resulting in 2 copies if the object is being passed by reference, and 1 copy and 2 move in the second case if the object is passed by value.

Why is passing by const ref slower when using std::async

2 Answers2

The short story

The long story