7

I'm reading Scott's book effective modern c++. In item 26, there's an example that I wrote on Wandbox: https://wandbox.org/permlink/6DKoDqg4jAjA9ZTB

I want to verify how much the good code is better than the bad one. However, the performance comparison is not what I expected, even the good one is slower that the bad one. I don't know what's going wrong.

To prevent wandbox's code from disappearing, here's the code:

#include <iostream>
#include <chrono>
#include <cstdlib>
#include <set>
#include <string>

using namespace std;

std::multiset<std::string> names;

void bad_logAndAdd(const std::string& name) {
    auto now = std::chrono::system_clock::now();
    names.emplace(name);
}
template <typename T>
void good_logAndAdd(T&& name) {
    auto now = std::chrono::system_clock::now();
    names.emplace(std::forward<T>(name));
}

void bad() {
    for (int i=0; i<1000000; i++) {
        string petname("cs");
        bad_logAndAdd(petname);
        bad_logAndAdd(std::string("abc")); 
        bad_logAndAdd("dog"); 
    }

}

void good() {
    for (int i=0; i<1000000; i++) {
        string petname("cs");
        good_logAndAdd(petname); 
        good_logAndAdd(std::string("abc")); 
        good_logAndAdd("dog");
    }

}

int main()
{
    auto begin = std::chrono::high_resolution_clock::now();
    bad();
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << std::endl;

    auto begin2 = std::chrono::high_resolution_clock::now();
    good();
    auto end2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end2-begin2).count() << std::endl;
}
Lewis Chan
  • 695
  • 6
  • 18
  • My BAD. But it seems that no big improvement even when clearing the `names` – Lewis Chan Aug 26 '19 at 11:38
  • Is this a copy/paste of the example or did you make some small changes? – Holt Aug 26 '19 at 11:39
  • Very alike the original one. I just add the loop for some kind of benchmarking. – Lewis Chan Aug 26 '19 at 11:40
  • 2
    Your strings are very small, use very loooooooong ones! – frogatto Aug 26 '19 at 11:42
  • @HiI'mFrogatto I came to the same conclusion - it is because the strings/data are very small – darune Aug 26 '19 at 11:45
  • As @HiI'mFrogatto, you string are small, unlike the ones in the book, so [SSO](https://stackoverflow.com/questions/21694302/what-are-the-mechanics-of-short-string-optimization-in-libc) would kick in, making the move basically useless. But this actually does not really change the outcome. The compiler could be inlining both calls, making the forward version not that much efficient, and performs other stuff, that would make both version quite similar in the end. – Holt Aug 26 '19 at 11:48
  • @Holt You're right. I tried make the string bigger, and the outcomes are still same. – Lewis Chan Aug 26 '19 at 11:55
  • So how can we see the real performance differences, given what Scott said ? – Lewis Chan Aug 26 '19 at 12:05
  • The performance is about the same for both, on my machine. About a ¼% faster for the "good" one. But the test isn't very interesting, since the test doesn't leverage the advantage that forwarding provides (and the call site isn't `std::move` the named strings), and the bad code path is efficient (because const reference). – Eljay Aug 26 '19 at 12:37
  • The root of your problem is that you use the same std::multiset names; for both bad and good tests. std::multiset insertion becomes more slowly after iterations. To learn more, I just change order of test (good before bad) and get the same durations as if i do bad before good. So split it as std::multiset namesForBad, namesForGood; With my environment, I get something like 3141 for bad and 3045 for good!!. – Ratah Aug 26 '19 at 13:25

1 Answers1

4

There are several things that make the difference of behaviour hardly perceptible.

  1. a lot of the time is spent in auto now = std::chrono::system_clock::now();,
  2. a lot of time is spent in memory allocation (names are all stored),
  3. amongst the three calls to logAndAdd(), only two of them can benefit from move (petname is not implicitly moveable),
  4. small strings won't benefit from move.

I have tried to rewrite (below) the OP's code in order to minimise the influence of these things.

  1. remove the call to now() at each insertion,
  2. prefer many medium size allocations to a huge one and consider a warmup in order to avoid the penalty of the first allocations,
  3. only use the calls that could benefit from the copy/move difference,
  4. make the strings longer in order to make their copy expensive.

In these conditions, I obtain a visible difference between the two solutions (gcc 9.1, linux).

bad: 1.23
good: 1.01

I hope the resulting code is not too far from what was intended in the question.

#include <iostream>
#include <chrono>
#include <cstdlib>
#include <set>
#include <string>

std::multiset<std::string> bad_names;
std::multiset<std::string> good_names;

void bad_logAndAdd(const std::string& name) {
    // auto now = std::chrono::system_clock::now();
    bad_names.emplace(name);
}
template <typename T>
void good_logAndAdd(T&& name) {
    // auto now = std::chrono::system_clock::now();
    good_names.emplace(std::forward<T>(name));
}

void bad() {
    for (int i=0; i<2000; i++) {
        // std::string petname("cs_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar");
        // bad_logAndAdd(petname);
        bad_logAndAdd(std::string("abc_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar")); 
        bad_logAndAdd("dog_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar"); 
    }

}

void good() {
    for (int i=0; i<2000; i++) {
        // std::string petname("cs_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar");
        // good_logAndAdd(petname); 
        good_logAndAdd(std::string("abc_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar")); 
        good_logAndAdd("dog_that_have_been_made_much_longer_in_order_to_prevent_small_string_optimisation_that_makes_copy_and_move_very_similar");
    }

}

int main()
{
    auto bad_time=std::chrono::high_resolution_clock::duration{};
    auto good_time=std::chrono::high_resolution_clock::duration{};
    for(auto iter=0; iter<1000; ++iter)
    {
      bad_names={};
      auto begin = std::chrono::high_resolution_clock::now();
      bad();
      auto end = std::chrono::high_resolution_clock::now();
      good_names={};
      auto begin2 = std::chrono::high_resolution_clock::now();
      good();
      auto end2 = std::chrono::high_resolution_clock::now();
      if(iter!=0) // ignore warmup
      {
        bad_time+=end-begin;
        good_time+=end2-begin2;
      }
    }
    std::cout << "bad: " << 1e-3*double(std::chrono::duration_cast<std::chrono::milliseconds>(bad_time).count()) << '\n';
    std::cout << "good: " << 1e-3*double(std::chrono::duration_cast<std::chrono::milliseconds>(good_time).count()) << '\n';
    return 0;
}
prog-fh
  • 13,492
  • 1
  • 15
  • 30
  • Your `bad_time` and `good_time` are quite ugly... Why not use `std::chrono::high_resolution_clock::duration` and simply use `.count()` at the end for displaying? Even if this is not the core of this answer, use `std::chrono` but handling durations as `double` is simply contradicting the purpose of `std::chrono`. – Holt Aug 26 '19 at 13:20
  • I'd thought the difference would be big, bigger than your result (Yours is just 1/6 more efficient, I think), especially under stress. Have we forgot something or it's just the fact that the good version is just a little better than the bad one ? – Lewis Chan Aug 26 '19 at 14:36
  • @LewisChan 18% faster code is pretty huge in the world of optimising, especially for a micro-optimization, non-algorithm related. – bolov Aug 26 '19 at 15:36
  • @LewisChan I think that, even in my modified version, the stress is more on the allocator used internally by `std::multiset` than on parameter passing. – prog-fh Aug 26 '19 at 15:37