7

Suppose I have

std::vector<T1> vec1 {/* filled with T1's */};
std::vector<T2> vec2 {/* filled with T2's */};

and some function T1 f(T2) which could of course be a lambda. What is the optimal way to concatenate vec1 and vec2 whilst applying f to each T2 in vec2?

The apparently obvious solution is std::transform, i.e.

vec1.reserve(vec1.size() + vec2.size());
std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), f);

but I say this is not optimal as std::back_inserter must make an unnecessary capacity check on each inserted element. What would be optimal is something like

vec1.insert(vec1.end(), vec2.begin(), vec2.end(), f);

which could get away with a single capacity check. Sadly this is not valid C++. Essentially this is the same reason why std::vector::insert is optimal for vector concatenation, see this question and the comments in this question for further discussion on this point.

So:

  1. Is std::transform the optimal method using the STL?
  2. If so, can we do better?
  3. Is there a good reason why the insert function described above was left out of the STL?

UPDATE

I've had a go at verifying if the multiple capacity checks do have any noticeable cost. To do this I basically just pass the id function (f(x) = x) to the std::transform and push_back methods discussed in the answers. The full code is:

#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <numeric>
#include <random>

using std::size_t;

std::vector<int> generate_random_ints(size_t n)
{
    std::default_random_engine generator;
    auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
    generator.seed((unsigned) seed1);
    std::uniform_int_distribution<int> uniform {};
    std::vector<int> v(n);
    std::generate_n(v.begin(), n, [&] () { return uniform(generator); });
    return v;
}

template <typename D=std::chrono::nanoseconds, typename F>
D benchmark(F f, unsigned num_tests)
{
    D total {0};
    for (unsigned i = 0; i < num_tests; ++i) {
        auto start = std::chrono::system_clock::now();
        f();
        auto end = std::chrono::system_clock::now();
        total += std::chrono::duration_cast<D>(end - start);
    }
    return D {total / num_tests};
}

template <typename T>
void std_insert(std::vector<T> vec1, const std::vector<T> &vec2)
{
    vec1.insert(vec1.end(), vec2.begin(), vec2.end());
}

template <typename T1, typename T2, typename UnaryOperation>
void push_back_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    vec1.reserve(vec1.size() + vec2.size());
    for (const auto& x : vec2) {
        vec1.push_back(op(x));
    }
}

template <typename T1, typename T2, typename UnaryOperation>
void transform_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    vec1.reserve(vec1.size() + vec2.size());
    std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), op);
}

int main(int argc, char **argv)
{
    unsigned num_tests {1000};
    size_t vec1_size {10000000};
    size_t vec2_size {10000000};

    auto vec1 = generate_random_ints(vec1_size);
    auto vec2 = generate_random_ints(vec1_size);

    auto f_std_insert = [&vec1, &vec2] () {
        std_insert(vec1, vec2);
    };
    auto f_push_back_id = [&vec1, &vec2] () {
        push_back_concat(vec1, vec2, [] (int i) { return i; });
    };
    auto f_transform_id = [&vec1, &vec2] () {
        transform_concat(vec1, vec2, [] (int i) { return i; });
    };

    auto std_insert_time   = benchmark<std::chrono::milliseconds>(f_std_insert, num_tests).count();
    auto push_back_id_time = benchmark<std::chrono::milliseconds>(f_push_back_id, num_tests).count();
    auto transform_id_time = benchmark<std::chrono::milliseconds>(f_transform_id, num_tests).count();

    std::cout << "std_insert: " << std_insert_time << "ms" << std::endl;
    std::cout << "push_back_id: " << push_back_id_time << "ms" << std::endl;
    std::cout << "transform_id: " << transform_id_time << "ms" << std::endl;

    return 0;
}

Compiled with:

g++ vector_insert_demo.cpp -std=c++11 -O3 -o vector_insert_demo

Output:

std_insert: 44ms
push_back_id: 61ms
transform_id: 61ms

The compiler will have inlined the lambda, so that cost can be safely be discounted. Unless anyone else has a viable explanation for these results (or is willing to check the assembly), I think it's reasonable to conclude there is a noticeable cost of the multiple capacity checks.

Community
  • 1
  • 1
Daniel
  • 8,179
  • 6
  • 31
  • 56
  • 6
    I do not quite believe that the multiple capacity checks are that bad. Branch prediction is strong. – Baum mit Augen Feb 16 '15 at 15:50
  • 1
    You can define transforming iterator if you want to use insert directly. – zch Feb 16 '15 at 15:53
  • 5
    `but I say this is not optimal ` Have you actually measured this with an optimized program? A lot of C++ that looks "wordy" is often mistaken as being "slow". – PaulMcKenzie Feb 16 '15 at 15:59
  • 5
    It always surprises me but depending on how cheap `T1` is to default construct, you may find it is actually faster to `resize` `vec1` rather than `reserve` and then use `std::transform` without the `std::back_insertor`. But you will have to measure it. – Chris Drew Feb 16 '15 at 16:08
  • @PaulMcKenzie I can't benchmark against a solution I don't know.. And I didn't say the transform method was slow. – Daniel Feb 16 '15 at 16:47
  • @Dmity Ledentsov this is not a duplicate of that question, have you actually read my question? – Daniel Feb 16 '15 at 16:49
  • 1
    Did you deduce *by measurement* that this wasn't fast enough? Both methods have the same amortized time complexity, and if the vectors are big, it won't probably make much of a difference. (and if the vectors are so small that micro-optimization constant factors like this make a significant difference, then you shouldn't care anyway.) – The Paramagnetic Croissant Feb 16 '15 at 17:13
  • 2
    Why speed up something that is not a bottleneck in your application? – n. m. could be an AI Feb 16 '15 at 17:14
  • 1
    @Daniel - Honestly, it looks like you're going through "analysis by paralysis". The code you have is not a bottleneck, so don't fret over it. – PaulMcKenzie Feb 16 '15 at 17:15
  • 1
    You ask for the optimal solution? Well, that's likely refactoring your code so that you don't need this concatenation in the first place. Honestly, you are fretting about a check that is *always* going to succeed, your CPU *will* realize that and reduce the cost of that condition to a few cycles. Even if the involved types are as simple as `long long`, the condition check will likely take less time than the copy for a long vector. – cmaster - reinstate monica Feb 16 '15 at 20:34
  • I'm not at-all arguing this is going to make a substantial difference, but the fact remains that a small overhead exists that could easily be avoided. Paraphrasing Scott Meyers in his recent EMC++ book when he considered the additional cost of pass-by-value compared to universal references (minimum 1 move) - you may as-well use Python if we aren't at-least sympathetic to performance freaks ;) – Daniel Feb 16 '15 at 23:49
  • Also, while this may not seem an issue for concatenation, if the more general problem of insertion is considered then the [costs involved are not so obvious](http://stackoverflow.com/questions/28551380/inserting-a-vector-transformation/28551784#28551784). – Daniel Feb 16 '15 at 23:52
  • @DmitryLedentsov Could you please remove your duplicate flag, this is clearly not the same question? – Daniel Feb 17 '15 at 14:03
  • @Daniel [sorry](http://meta.stackexchange.com/questions/87500/cancel-misclicked-flags), I don't think there's such a feature. The comment is now removed. I've read your question, and apart from a general advice on profiling I thought, you're searching for a "canonical" way. – Dmitry Ledentsov Feb 17 '15 at 15:24
  • @cmaster My results suggest otherwise (see question update). – Daniel Feb 18 '15 at 00:09
  • @Daniel Your results are in perfect accordance with my prediction: The overhead of the check (17ms) is less than the time to copy the data (44ms). Assuming a 3GHz CPU, that means you need 13.2 cycles for the copy of an `int` (many cache misses!) and 5.1 cycles for the check (which is indeed more than I anticipated). The larger the type, the higher the cost of copying becomes while the cost for the check remains constant. – cmaster - reinstate monica Feb 18 '15 at 10:20

3 Answers3

2

UPDATE: The performance difference is due to the reserve() calls, which, in libstdc++ at least, make the capacity be exactly what you request instead of using the exponential growth factor.


I did some timing tests, with interesting results. Using std::vector::insert along with boost::transform_iterator was the fastest way I found by a large margin:

Version 1:

void
  appendTransformed1(
    std::vector<int> &vec1,
    const std::vector<float> &vec2
  )
{
  auto v2begin = boost::make_transform_iterator(vec2.begin(),f);
  auto v2end   = boost::make_transform_iterator(vec2.end(),f);
  vec1.insert(vec1.end(),v2begin,v2end);
}

Version 2:

void
  appendTransformed2(
    std::vector<int> &vec1,
    const std::vector<float> &vec2
  )
{
  vec1.reserve(vec1.size()+vec2.size());
  for (auto x : vec2) {
    vec1.push_back(f(x));
  }
}

Version 3:

void
  appendTransformed3(
    std::vector<int> &vec1,
    const std::vector<float> &vec2
  )
{
  vec1.reserve(vec1.size()+vec2.size());
  std::transform(vec2.begin(),vec2.end(),std::inserter(vec1,vec1.end()),f);
}

Timing:

    Version 1: 0.59s
    Version 2: 8.22s
    Version 3: 8.42s

main.cpp:

#include <algorithm>
#include <cassert>
#include <chrono>
#include <iterator>
#include <iostream>
#include <random>
#include <vector>
#include "appendtransformed.hpp"

using std::cerr;

template <typename Engine>
static std::vector<int> randomInts(Engine &engine,size_t n)
{
  auto distribution = std::uniform_int_distribution<int>(0,999);
  auto generator = [&]{return distribution(engine);};
  auto vec = std::vector<int>();
  std::generate_n(std::inserter(vec,vec.end()),n,generator);
  return vec;
}

template <typename Engine>
static std::vector<float> randomFloats(Engine &engine,size_t n)
{
  auto distribution = std::uniform_real_distribution<float>(0,1000);
  auto generator = [&]{return distribution(engine);};
  auto vec = std::vector<float>();
  std::generate_n(std::inserter(vec,vec.end()),n,generator);
  return vec;
}

static auto
  appendTransformedFunction(int version) ->
    void(*)(std::vector<int>&,const std::vector<float> &)
{
  switch (version) {
    case 1: return appendTransformed1;
    case 2: return appendTransformed2;
    case 3: return appendTransformed3;
    default:
      cerr << "Unknown version: " << version << "\n";
      exit(EXIT_FAILURE);
  }

  return 0;
}

int main(int argc,char **argv)
{
  if (argc!=2) {
    cerr << "Usage: appendtest (1|2|3)\n";
    exit(EXIT_FAILURE);
  }
  auto version = atoi(argv[1]);
  auto engine = std::default_random_engine();
  auto vec1_size = 1000000u;
  auto vec2_size = 1000000u;
  auto count = 100;
  auto vec1 = randomInts(engine,vec1_size);
  auto vec2 = randomFloats(engine,vec2_size);
  namespace chrono = std::chrono;
  using chrono::system_clock;
  auto appendTransformed = appendTransformedFunction(version);
  auto start_time = system_clock::now();
  for (auto i=0; i!=count; ++i) {
    appendTransformed(vec1,vec2);
  }
  auto end_time = system_clock::now();
  assert(vec1.size() == vec1_size+count*vec2_size);
  auto sum = std::accumulate(vec1.begin(),vec1.end(),0u);
  auto elapsed_seconds = chrono::duration<float>(end_time-start_time).count();

  cerr << "Using version " << version << ":\n";
  cerr << "  sum=" << sum << "\n";
  cerr << "  elapsed: " << elapsed_seconds << "s\n";
}

Compiler: g++ 4.9.1

Options: -std=c++11 -O2

Vaughn Cato
  • 63,448
  • 5
  • 82
  • 132
  • wow, I did not expect that.. any idea how `boost:: transform_iterator' works? By the way, you may have also answered [my more general question](http://stackoverflow.com/questions/28551380/inserting-a-vector-transformation/28559270#28559270). I'm currently battling to install boost on my system, once I have it I'll run some numbers myself to confirm your results. – Daniel Feb 17 '15 at 10:18
  • I can't replicate your results (see my answer). – Daniel Feb 17 '15 at 12:15
  • @Daniel: Here's a full example on ideone: https://ideone.com/i9R2Mp . I took out the command-line parameters and made the version number be a variable at the top of main. I needed to reduce the number of iterations to have it run within the time limit. The difference is less dramatic using fewer iterations. – Vaughn Cato Feb 17 '15 at 14:39
  • @Daniel: I did some profiling, which revealed that most of the time in version 1 and 2 was in the `reserve()` calls. When I take out the `reserve()` calls, the results are very different. The `push_back` version is now the fastest. – Vaughn Cato Feb 17 '15 at 14:52
  • @Daniel: Apparently the libstdc++ version of `std::vector::reserve()` allocates exactly the number of elements requested instead of using the exponential growth factor, so the vector elements were being copied on every call in version 2 and 3. I think this is reasonable behavior, but it does mean you have to be careful if you use `reserve()` multiple times on the same vector. – Vaughn Cato Feb 17 '15 at 15:09
  • that seems reasonable. I think the fairer test is to keep the vector sizes constant in any case - keeping the test cases as i.i.d. as possible - then even 100 iterations should give an approximation very close to the true mean. – Daniel Feb 17 '15 at 16:18
  • What if you time again, using `vec1.reserve(std::max(vec1.size()+vec2.size(), vec1.capacity()));` instead of `vec1.reserve(vec1.size()+vec2.size());`? That'll keep `reserve` from eagerly copying all the elements of `vec1` up front. – Quuxplusone Feb 17 '15 at 23:10
0
  1. Is std::transform the optimal method using the STL?

I can't say that. If you reserve space, the difference should be ephemeral because the check might be optimized out by either the compiler or the CPU. The only way to find out is to measure your real code.
If you don't have a particular need, you should go for std::transform.

  1. If so, can we do better?

What you want to have:

  • Reduce length checks
  • Take advantage of move semantics when push'n_back

You might also want to create a binary function, if needed.

template <typename InputIt, typename OutputIt, typename UnaryCallable>
void move_append(InputIt first, InputIt last, OutputIt firstOut, OutputIt lastOut, UnaryCallable fn)
{
       if (std::distance(first, last) < std::distance(firstOut, lastOut)
           return;

       while (first != last && firstOut != lastOut) {
              *firstOut++ = std::move( fn(*first++) );

       }
 }

a call could be:

std::vector<T1> vec1 {/* filled with T1's */};
std::vector<T2> vec2 {/* filled with T2's */};
// ...
vec1.resize( vec1.size() + vec2.size() );
move_append( vec1.begin(), vec1.end(), vec2.begin(), vec2.end(), f );

I'm not sure you can do this with plain algorithms because back_inserter would call Container::push_back which will check in any case for reallocation. Also, the element won't be able to benefit from move semantics.

Note: the safety check depends on your usage, based on how you pass the elements to append. Also it should return a bool.

Some measurements here. I can't explain that big discrepancy.

edmz
  • 8,220
  • 2
  • 26
  • 45
  • You're right about the call to `push_back` which would cause a capacity check, but it's possible to move, e.g. `std::transform(std::make_move_iterator(vec2.begin()), std::make_move_iterator(vec2.end()), std::back_inserter(vec1), f);` – Daniel Feb 16 '15 at 23:34
  • Also note your solution does not increase the size of the vector being appended to, so there's a hidden cost of a `resize`. – Daniel Feb 16 '15 at 23:40
  • Yes @Daniel you can move but not avoid the push_back check. Also I do call reserve before appending. – edmz Feb 17 '15 at 09:13
  • you cannot use reserve with your function `move_append`, it will not change the size of the vector, consider `std::vector v {1, 2, 3}; v.reserve(5); v[4] = 4; std::cout << v.size() << std::endl;` This will print `3`. So you either need to `resize`, or use `emplace_back`. – Daniel Feb 17 '15 at 10:10
  • 1
    see [this](http://stackoverflow.com/questions/13029299/stdvectorresize-vs-stdvectorreserve) question for `reserve` vs `resize`. – Daniel Feb 17 '15 at 10:13
0

I do not get the same results as @VaughnCato - although I do a slightly different test of std::string to int. According to my tests the push_back and std::transform methods are equally good, while the boost::transform method is slightly worse. Here is my full code:

UPDATE

I included another test case that instead of using reserve and back_inserter, just uses resize. This is essentially the same method as in @black's answer, and also the method suggested by @ChrisDrew in the question comments. I also performed the test 'both ways' that is std::string -> int, and int -> std::string.

#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <numeric>
#include <random>
#include <boost/iterator/transform_iterator.hpp>

using std::size_t;

std::vector<int> generate_random_ints(size_t n)
{
    std::default_random_engine generator;
    auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
    generator.seed((unsigned) seed1);
    std::uniform_int_distribution<int> uniform {};
    std::vector<int> v(n);
    std::generate_n(v.begin(), n, [&] () { return uniform(generator); });
    return v;
}

std::vector<std::string> generate_random_strings(size_t n)
{
    std::default_random_engine generator;
    auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
    generator.seed((unsigned) seed1);
    std::uniform_int_distribution<int> uniform {};
    std::vector<std::string> v(n);
    std::generate_n(v.begin(), n, [&] () { return std::to_string(uniform(generator)); });
    return v;
}

template <typename D=std::chrono::nanoseconds, typename F>
D benchmark(F f, unsigned num_tests)
{
    D total {0};
    for (unsigned i = 0; i < num_tests; ++i) {
        auto start = std::chrono::system_clock::now();
        f();
        auto end = std::chrono::system_clock::now();
        total += std::chrono::duration_cast<D>(end - start);
    }
    return D {total / num_tests};
}

template <typename T1, typename T2, typename UnaryOperation>
void push_back_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    vec1.reserve(vec1.size() + vec2.size());
    for (const auto& x : vec2) {
        vec1.push_back(op(x));
    }
}

template <typename T1, typename T2, typename UnaryOperation>
void transform_concat_reserve(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    vec1.reserve(vec1.size() + vec2.size());
    std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), op);
}

template <typename T1, typename T2, typename UnaryOperation>
void transform_concat_resize(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    auto vec1_size = vec1.size();
    vec1.resize(vec1.size() + vec2.size());
    std::transform(vec2.begin(), vec2.end(), vec1.begin() + vec1_size, op);
}

template <typename T1, typename T2, typename UnaryOperation>
void boost_transform_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
    auto v2_begin = boost::make_transform_iterator(vec2.begin(), op);
    auto v2_end   = boost::make_transform_iterator(vec2.end(), op);
    vec1.insert(vec1.end(), v2_begin, v2_end);
}

int main(int argc, char **argv)
{
    unsigned num_tests {1000};
    size_t vec1_size {1000000};
    size_t vec2_size {1000000};

    // Switch the variable names to inverse test
    auto vec1 = generate_random_ints(vec1_size);
    auto vec2 = generate_random_strings(vec2_size);

    auto op = [] (const std::string& str) { return std::stoi(str); };
    //auto op = [] (int i) { return std::to_string(i); };

    auto f_push_back_concat = [&vec1, &vec2, &op] () {
        push_back_concat(vec1, vec2, op);
    };
    auto f_transform_concat_reserve = [&vec1, &vec2, &op] () {
        transform_concat_reserve(vec1, vec2, op);
    };
    auto f_transform_concat_resize = [&vec1, &vec2, &op] () {
        transform_concat_resize(vec1, vec2, op);
    };
    auto f_boost_transform_concat = [&vec1, &vec2, &op] () {
        boost_transform_concat(vec1, vec2, op);
    };

    auto push_back_concat_time = benchmark<std::chrono::milliseconds>(f_push_back_concat, num_tests).count();
    auto transform_concat_reserve_time = benchmark<std::chrono::milliseconds>(f_transform_concat_reserve, num_tests).count();
    auto transform_concat_resize_time = benchmark<std::chrono::milliseconds>(f_transform_concat_resize, num_tests).count();
    auto boost_transform_concat_time = benchmark<std::chrono::milliseconds>(f_boost_transform_concat, num_tests).count();

    std::cout << "push_back: " << push_back_concat_time << "ms" << std::endl;
    std::cout << "transform_reserve: " << transform_concat_reserve_time << "ms" << std::endl;
    std::cout << "transform_resize: " << transform_concat_resize_time << "ms" << std::endl;
    std::cout << "boost_transform: " << boost_transform_concat_time << "ms" << std::endl;

    return 0;
}

Compiled using:

g++ vector_concat.cpp -std=c++11 -O3 -o vector_concat_test

The results (mean user-times) are :

|          Method          | std::string -> int (ms) | int -> std::string (ms) |
|:------------------------:|:-----------------------:|:-----------------------:|
| push_back                |            68           |           206           |
| std::transform (reserve) |            68           |           202           |
| std::transform (resize)  |            67           |           218           |
| boost::transform         |            70           |           238           |

PROVISIONAL CONCLUSION

  • The std::transform method using resize is likely optimal (using STL) for trivial to default-construct types.
  • The std::transform method using reserve and back_inserter is most likely the best we can do otherwise.
Daniel
  • 8,179
  • 6
  • 31
  • 56
  • FWIW I did reproduce VaughnCato's results using his benchmark code, I'm not sure why this benchmark gives different results. One difference I can see is VaughnCato passed `vec1` by reference which means it grows continually during the test but you pass it by value so you get a new copy every test. Perhaps that means the compiler can more easily optimize out the capacity checks. – Chris Drew Feb 17 '15 at 13:16
  • Would you like to test my version as well? – edmz Feb 17 '15 at 15:33
  • @black done, note your answer is now (after the correction), to the method suggested by ChrisDrew in the question comments. – Daniel Feb 17 '15 at 16:53
  • @ChrisDrew Good spot, I don't believe that is a fair test. Also, I've included a test to cover your suggestion in the question comments - you were right ;) – Daniel Feb 17 '15 at 17:06
  • I don't think there is anything "unfair" about VaughnCato's test. It is just testing something different. It just goes to show it is worth measuring for your particular use case. – Chris Drew Feb 17 '15 at 17:19
  • @ChrisDrew I mean unfair statistically, the idea of repeated-measure tests is to try and get an unbiased estimate of the true mean by generating multiple i.i.d samples. This is not the case in VaughnCato's test. – Daniel Feb 17 '15 at 17:24