About unique_ptr performances

Question

I often read that unique_ptr would be preferred in most situations over shared_ptr because unique_ptr is non-copyable and has move semantics; shared_ptr would add an overhead due to copy and ref-counting;

But when I test unique_ptr in some situations, it appears it's noticably slower (in access) than its counterparts

For example, under gcc 4.5 :

edit : the print method doesn't print anything actually

#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>

class Print{

public:
void print(){}

};

void test()
{
 typedef vector<shared_ptr<Print>> sh_vec;
 typedef vector<unique_ptr<Print>> u_vec;

 sh_vec shvec;
 u_vec  uvec;

 //can't use initializer_list with unique_ptr
 for (int var = 0; var < 100; ++var) {

    shared_ptr<Print> p(new Print());
    shvec.push_back(p);

    unique_ptr<Print> p1(new Print());
    uvec.push_back(move(p1));

  }

 //-------------test shared_ptr-------------------------
 auto time_sh_1 = std::chrono::system_clock::now();

 for (auto var = 0; var < 1000; ++var) 
 {
   for(auto it = shvec.begin(), end = shvec.end(); it!= end; ++it)
   {
     (*it)->print();
   }
 }

 auto time_sh_2 = std::chrono::system_clock::now();

 cout <<"test shared_ptr : "<< (time_sh_2 - time_sh_1).count() << " microseconds." << endl;

 //-------------test unique_ptr-------------------------
 auto time_u_1 = std::chrono::system_clock::now();

 for (auto var = 0; var < 1000; ++var) 
 {
   for(auto it = uvec.begin(), end = uvec.end(); it!= end; ++it)
   {
     (*it)->print();
   }
 }

 auto time_u_2 = std::chrono::system_clock::now();

 cout <<"test unique_ptr : "<< (time_u_2 - time_u_1).count() << " microseconds." << endl;

}

On average I get (g++ -O0) :

shared_ptr : 1480 microseconds
unique_ptr : 3350 microseconds

where does the difference come from ? is it explainable ?

What compiler flags are you using? And ... what does gprof show? — Useless, Nov 15 '11 at 15:00
Are you compiling with or without optimizations? Profiling without optimization is useless. — Luchian Grigore, Nov 15 '11 at 15:02
It's not noticeably different here: http://www.ideone.com/hmRK4 — R. Martinho Fernandes, Nov 15 '11 at 15:07
If you compile with -O2, the timings reverses with gcc 4.6.(though with optimization, bump up the no. of times the loop is done by a factor of 100 or so, so you at least can measure more than the jittering of the OS scheduler.) — nos, Nov 15 '11 at 15:16
The main difference between the two is that `unique_ptr` doesn't perform any dynamic allocations, while `shared_ptr` does (in the way you use it). — Kerrek SB, Nov 15 '11 at 15:17
Cannot reproduce. With optimizations, the program does nothing. I added a volatile int member increment to the `print()` function, and the `unique_ptr` performs better consistently now. — Kerrek SB, Nov 15 '11 at 15:23
@nos indeed with an -O2 optimization, both tests last 1 microsecond — codablank1, Nov 15 '11 at 15:27
@codablank1 - I suspect your timing function isn't great once you get down to that kind of magnitude. — Flexo, Nov 15 '11 at 15:30
@codablank1: If you're benchmarking something on the order of "microseconds", then you probably aren't doing enough work to register. And you should *never* benchmark debug code; always benchmark with the optimizations you plan to use. Otherwise, it's not a legit comparison. — Nicol Bolas, Nov 15 '11 at 19:31
Talking about performance with `-O0` flag is simply funny, non-sense and meaningless. — eonil, Dec 17 '13 at 03:22

Soheil Hassas Yeganeh · Answer 1 · 2014-01-01T23:58:30.543

UPDATED on Jan 01, 2014

I know this question is pretty old, but the results are still valid on G++ 4.7.0 and libstdc++ 4.7. So, I tried to find out the reason.

What you're benchmarking here is the dereferencing performance using -O0 and, looking at the implementation of unique_ptr and shared_ptr, your results are actually correct.

unique_ptr stores the pointer and the deleter in a ::std::tuple, while shared_ptr stores a naked pointer handle directly. So, when you dereference the pointer (using *, ->, or get) you have an extra call to ::std::get<0>() in unique_ptr. In contrast, shared_ptr directly returns the pointer. ~~On gcc-4.7 even when optimized and inlined, ::std::get<0>() is a bit slower than the direct pointer.~~. When optimized and inlined, gcc-4.8.1 fully omits the overhead of ::std::get<0>(). On my machine, when compiled with -O3, the compiler generates exactly the same assembly code, which means they are literally the same.

All in all, using the current implementation, shared_ptr is slower on creation, moving, copying and reference counting, but equally as fast *on dereferencing*.

NOTE: print() is empty in the question and the compiler omits the loops when optimized. So, I slightly changed the code to correctly observe the optimization results:

#include <iostream>
#include <string>
#include <memory>
#include <chrono>
#include <vector>

using namespace std;

class Print {
 public:
  void print() { i++; }

  int i{ 0 };
};

void test() {
  typedef vector<shared_ptr<Print>> sh_vec;
  typedef vector<unique_ptr<Print>> u_vec;

  sh_vec shvec;
  u_vec uvec;

  // can't use initializer_list with unique_ptr
  for (int var = 0; var < 100; ++var) {
    shvec.push_back(make_shared<Print>());
    uvec.emplace_back(new Print());
  }

  //-------------test shared_ptr-------------------------
  auto time_sh_1 = std::chrono::system_clock::now();

  for (auto var = 0; var < 1000; ++var) {
    for (auto it = shvec.begin(), end = shvec.end(); it != end; ++it) {
      (*it)->print();
    }
  }

  auto time_sh_2 = std::chrono::system_clock::now();

  cout << "test shared_ptr : " << (time_sh_2 - time_sh_1).count()
       << " microseconds." << endl;

  //-------------test unique_ptr-------------------------
  auto time_u_1 = std::chrono::system_clock::now();

  for (auto var = 0; var < 1000; ++var) {
    for (auto it = uvec.begin(), end = uvec.end(); it != end; ++it) {
      (*it)->print();
    }
  }

  auto time_u_2 = std::chrono::system_clock::now();

  cout << "test unique_ptr : " << (time_u_2 - time_u_1).count()
       << " microseconds." << endl;
}

int main() { test(); }

NOTE: That is not a fundamental problem and can be easily fixed by discarding the use of ::std::tuple in current libstdc++ implementation.

I am not convinced by your ::std::get<0>() argument. The compiler has all information at hand to completely eliminate everything else of the call than the actual dereferencing. Whether it does it or not in practise may be a different question... — ingomueller.net, May 07 '13 at 07:11
This is bollocks. It can be easily fixed by enabling optimisations in the compiler. The library has no reason to change behaviour. People should stop complaining that their unoptimised code is slow. — R. Martinho Fernandes, Aug 28 '13 at 12:40
@ingomueller.net On gcc-4.8.1 (unlike 4.7.0), the compiler omits all overheads of std::get and generates exactly the same assembly code for both cases. I updated the answer to reflect that. — Soheil Hassas Yeganeh, Jan 01 '14 at 23:59

Puppy · Accepted Answer · 2011-11-15T15:24:38.770

13

All you did in the timed blocks is access them. That won't involve any additional overhead at all. The increased time probably comes from the console output scrolling. You can never, ever do I/O in a timed benchmark.

And if you want to test the overhead of ref counting, then actually do some ref counting. How is the increased time for construction, destruction, assignment and other mutating operations of shared_ptr going to factor in to your time at all if you never mutate shared_ptr?

Edit: If there's no I/O then where are the compiler optimizations? They should have nuked the whole thing. Even ideone junked the lot.

edited Nov 15 '11 at 15:24

answered Nov 15 '11 at 15:07

Puppy

144,682
38
256
465

When you say you can't do I/O in a timed benchmark, I don't see the I/O occurring inside the for-loop that is being timed ... wouldn't a benchmark need to at some point output its results? – Jason Nov 15 '11 at 15:10
I assumed that the `/*print*/` comment actually contained some printing. – Puppy Nov 15 '11 at 15:11
there is no I/O insides the loop and the both tests are identical – codablank1 Nov 15 '11 at 15:14
1

@DeadMG: we are talking `microseconds` and the iteration takes place 1000 times. So that's about 30 **nano** seconds per access. – Matthieu M. Nov 15 '11 at 15:19
@codablank1: Then why didn't the compiler remove the whole lot? – Puppy Nov 15 '11 at 15:19
2

To be honest, with some modifications to the test, I'm also seeing that **access** to unique_ptr's value is consistently almost twice slower (with gcc 4.4.1 MinGW). - As far as I can see, "you should be doing some ref-counting" completely misses what's asked. The question is precisely about the unexpected slowness of access via unique_ptr. Feel welcome to recommend better profiling techniques, but don't recommend just timing something different instead. – UncleBens Nov 15 '11 at 16:57

Matthieu M. · Answer 3 · 2011-11-16T07:26:23.777

3

You're not testing anything useful here.

What you are talking about: copy

What you are testing: iteration

If you want to test copy, you actually need to perform a copy. Both smart pointers should have similar performance when it comes to reading, because good shared_ptr implementations will keep a local copy of the object pointed to.

EDIT:

Regarding the new elements:

It's not even worth talking about speed when using debug code, in general. If you care about performance, you will use release code (-O2 in general) and thus that's what should be measured, as there can be significant differences between debug and release code. Most notably, inlining of template code can seriously decrease the execution time.

Regarding the benchmark:

I would add another round of measures: naked pointers. Normally, unique_ptr and naked pointers should have the same performance, it would be worth checking it, and it need not necessarily be true in debug mode.
You might want to "interleave" the execution of the two batches or if you cannot, take the average of each among several runs. As it is, if the computer slows down during the end of the benchmark, only the unique_ptr batch will be affected which will perturbate the measure.

You might be interested in learning more from Neil: The Joy of Benchmarks, it's not a definitive guide, but it's quite interesting. Especially the part about forcing side-effects to avoid dead-code removal ;)

Also, be careful about how you measure. The resolution of your clock might be less precise than what it appears to be. If the clock is refreshed only every 15us for example, then any measure around 15us is suspicious. It might be an issue when measuring release code (you might need to add a few turns to the loop).

edited Nov 16 '11 at 07:26

answered Nov 15 '11 at 15:09

Matthieu M.

287,565
48
449
722

I know , my question is ill-formed but I can't explain the difference between both of them – codablank1 Nov 15 '11 at 15:12
@codablank1: I don't have a compiler at hand that can compile your code sample (lack of C++11 support...) so I cannot say. If you can inspect the Intermediate Representation or the assembly output you may be able to spot the difference. No idea here... especially since my hunch would be that `unique_ptr` ought to be faster as it's smaller, though it may not show much on only 100 items. – Matthieu M. Nov 15 '11 at 15:23
As far as I can see the asker is testing the right thing. Assume the usage pattern is fill vector once, access many times. The asker is worried about unique_ptr appearing to be worse for the latter. – UncleBens Nov 15 '11 at 17:17
@UncleBens: it seems the question changed substantially since I answered. The original question didn't mention that access was the concern and only talked about copying/moving before proposing the code. – Matthieu M. Nov 16 '11 at 07:15
Not really, IMO. It seems that you take the question to be: how to create a benchmark which demonstrates the superiority of unique_ptr? - BTW, I made those test separately compilable and checked the assembly. The only difference in the inner loop came from the difference in size between the pointers. The access itself was completely inlined in both cases. Yet, unique_ptr test kept being almost 2 times slower. Some subtly difference in the surrounding assembly? – UncleBens Nov 16 '11 at 07:57
@UncleBens: I *expect* `unique_ptr` to perform identically to `T*`, and at least as well as `shared_ptr` by design. As I said, `shared_ptr` implementations usually contain a local `T*` "cache" to avoid double dereferencing, so I am not surprised about full inlining in both cases, but I would have expected the "double" size of `shared_ptr` to slow it down (cache-wise). Cache effects can be surprising though, so maybe it's one of those cases. – Matthieu M. Nov 16 '11 at 08:18

About unique_ptr performances

3 Answers3

Linked