163

How much is the overhead of smart pointers compared to normal pointers in C++11? In other words, is my code going to be slower if I use smart pointers, and if so, how much slower?

Specifically, I'm asking about the C++11 std::shared_ptr and std::unique_ptr.

Obviously, the stuff pushed down the stack is going to be larger (at least I think so), because a smart pointer also needs to store its internal state (reference count, etc), the question really is, how much is this going to affect my performance, if at all?

For example, I return a smart pointer from a function instead of a normal pointer:

std::shared_ptr<const Value> getValue();
// versus
const Value *getValue();

Or, for example, when one of my functions accept a smart pointer as parameter instead of a normal pointer:

void setValue(std::shared_ptr<const Value> val);
// versus
void setValue(const Value *val);
Venemo
  • 18,515
  • 13
  • 84
  • 125
  • 10
    The only way to know is to benchmark your code. – Basile Starynkevitch Mar 10 '14 at 08:54
  • 1
    Which one do you mean? `std::unique_ptr` or `std::shared_ptr`? – stefan Mar 10 '14 at 08:55
  • 12
    The answer is 42. (another words, who knows, you need to profile your code and understand on your hardware for your typical work load.) – Nim Mar 10 '14 at 08:56
  • Your application needs to make extreme use of smart pointers for it to be significant. – user2672165 Mar 10 '14 at 10:59
  • The cost of using a shared_ptr in a simple setter function is terrible and will add a multiple 100% overhead. – Lothar Dec 17 '17 at 09:40
  • One way to look at it is this: If your dynamically-allocated objects are small/lightweight enough that the additional overhead of using smart-pointers to track them is a non-negligible portion of the resource-load, then perhaps you should be using value/copy-semantics instead (and thereby avoiding pointers and dynamic allocation entirely). Or to put it the other way, if the objects are big enough to be worth a per-object-heap-allocation, they are probably big enough to be worth a per-object-smart-pointer as well. – Jeremy Friesner Jan 23 '19 at 04:31
  • @Nim *in other words – Jon McClung Oct 03 '19 at 17:02
  • The “overhead” of shared_ptr that most people forget about is the COMPILE TIME COST of repeated template instantiations on different types. Clang (8) has very slow shared_ptr (and function) T-instantiations in ~100ms range on a fully-loaded 2019 MacBook Pro! While a unique_ptr does not have such a compile-time cost. YMMV, but such is a deal-breaker when used in heavily templated code. I recommend “lightweight shared ptr” implementations where such is practical. – user2864740 Nov 03 '19 at 23:36

6 Answers6

237

std::unique_ptr has memory overhead only if you provide it with some non-trivial deleter.

std::shared_ptr always has memory overhead for reference counter, though it is very small.

std::unique_ptr has time overhead only during constructor (if it has to copy the provided deleter and/or null-initialize the pointer) and during destructor (to destroy the owned object).

std::shared_ptr has time overhead in constructor (to create the reference counter), in destructor (to decrement the reference counter and possibly destroy the object) and in assignment operator (to increment the reference counter). Due to thread-safety guarantees of std::shared_ptr, these increments/decrements are atomic, thus adding some more overhead.

Note that none of them has time overhead in dereferencing (in getting the reference to owned object), while this operation seems to be the most common for pointers.

To sum up, there is some overhead, but it shouldn't make the code slow unless you continuously create and destroy smart pointers.

lisyarus
  • 15,025
  • 3
  • 43
  • 68
  • 20
    `unique_ptr` has no overhead in the destructor. It does exactly the same as you would with a raw pointer. – R. Martinho Fernandes Dec 15 '14 at 11:22
  • 11
    @R.MartinhoFernandes comparing to raw pointer itself, it does have time overhead in destructor, since raw pointer destructor does nothing. Comparing to how a raw pointer would probably be used, it surely has no overhead. – lisyarus Dec 15 '14 at 13:36
  • 5
    Worth noting that part of the shared_ptr construction/destruction/assignment cost is due to thread safety – Joe Mar 01 '16 at 17:25
  • 1
    Also, what about the default constructor of `std::unique_ptr`? If you construct a `std::unique_ptr`, the internal `int*` gets initialized to `nullptr` whether you like it or not. – Martin Drozdik May 14 '16 at 17:57
  • @Joe Thank you! Added this to the answer. – lisyarus Mar 19 '17 at 11:33
  • 1
    @MartinDrozdik In most situations you'd null-initialize the raw pointer too, to check it's nullity later, or something like that. Nevertheless, added this to the answer, thank you. – lisyarus Mar 19 '17 at 11:37
  • 3
    Are you certain that `std::shared_ptr` incurs no overhead when dereferencing the object? To my knowledge, `shared_ptr` points to a proxy object which holds a pair: {reference count, pointer to the actual object}. Therefore, you need to perform two jumps in the memory, not one to reach your object. – CygnusX1 Jun 10 '19 at 09:50
  • @CygnusX1 Yes, I am. A `std::shared_ptr` has two pointers: the *owned* pointer and the *referenced* pointer (see constructor #8 here https://en.cppreference.com/w/cpp/memory/shared_ptr/shared_ptr). These two pointers usually coincide, but what if you want a shared pointer to a member of a class that is itself stored through the shared pointer? You make a shared pointer that *owns* the whole class instance, but *references* the member. – lisyarus Jun 10 '19 at 11:05
  • @CygnusX1 Implementations usually use the proxy to store the *owned* pointer + reference count, and store the *referenced* pointer in the `shared_ptr` object itself, speeding up access. Here's a dumb verification that `sizeof(shared_ptr) == 2 * sizeof(pointer)`: https://ideone.com/XFq5Vc – lisyarus Jun 10 '19 at 11:05
  • 1
    @R.MartinhoFernandes Looking at the GCC code, this isn't true. During destruction, `unique_ptr` checks to see if the value is `nullptr` and always sets itself to `nullptr` after. This is because (unlike `delete`ing a raw pointer) a custom-deleter may not well handle `nullptr`, and the pointer itself doesn't know if whether or not it is going out of scope. I find it unlikley the compiler will optimise this if the deleter cannot be inlined. – c z Sep 03 '21 at 12:58
  • i am pretty sure there is performance hit during dereferencing because, i assume there are atomic operations involved and therefore a memory barrier. – MK. Jan 26 '23 at 21:45
  • 4
    @MK. Only the reference count is atomic, the pointer itself is stored as a simple raw pointer. So, no performance hit during dereferencing. – lisyarus Jan 27 '23 at 22:04
78

My answer is different from the others and i really wonder if they ever profiled code.

shared_ptr has a significant overhead for creation because of its memory allocation for the control block (which keeps the ref counter and a pointer list to all weak references). It has also a huge memory overhead because of this and the fact that std::shared_ptr is always a 2 pointer tuple (one to the object, one to the control block).

If you pass a shared_pointer to a function as a value parameter then it will be at least 10 times slower then a normal call and create lots of codes in the code segment for the stack unwinding. If you pass it by reference you get an additional indirection which can be also pretty worse in terms of performance.

Thats why you should not do this unless the function is really involved in ownership management. Otherwise use "shared_ptr.get()". It is not designed to make sure your object isn't killed during a normal function call.

If you go mad and use shared_ptr on small objects like an abstract syntax tree in a compiler or on small nodes in any other graph structure you will see a huge performance drop and a huge memory increase. I have seen a parser system which was rewritten soon after C++14 hit the market and before the programmer learned to use smart pointers correctly. The rewrite was a magnitude slower then the old code.

It is not a silver bullet and raw pointers aren't bad by definition either. Bad programmers are bad and bad design is bad. Design with care, design with clear ownership in mind and try to use the shared_ptr mostly on the subsystem API boundary.

If you want to learn more you can watch Nicolai M. Josuttis good talk about "The Real Price of Shared Pointers in C++" https://vimeo.com/131189627
It goes deep into the implementation details and CPU architecture for write barriers, atomic locks etc. once listening you will never talk about this feature being cheap. If you just want a proof of the magnitude slower, skip the first 48 minutes and watch him running example code which runs upto 180 times slower (compiled with -O3) when using shared pointer everywhere.

EDITED:

And if you ask about "std::unique_ptr" than visit this talk "CppCon 2019: Chandler Carruth “There Are No Zero-cost Abstractions” https://www.youtube.com/watch?v=rHIkrotSwcc

Its just not true, that unique_ptr is 100% cost free.

OFFTOPIC:

I tried to educate people about the false idea that using exceptions that are not thrown has no cost penalty for over two decades now. In this case it's in the optimizer and the code size.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23
Lothar
  • 12,537
  • 6
  • 72
  • 121
  • 1
    Thanks for your answer! Which platform did you profile on? Can you back up your claims with some data? – Venemo Dec 18 '17 at 17:13
  • I have no number to show, but you can find some in Nico Josuttis talk https://vimeo.com/131189627 – Lothar Dec 20 '17 at 18:44
  • 12
    Ever heard of `std::make_shared()`? Also, I find demonstrations of blatant misuse being bad a bit boring... – Deduplicator Dec 23 '17 at 19:32
  • 4
    All "make_shared" can do is safe you from one additional allocation and give you a bit more cache locality if the control block is allocated in front of the object. It can't not help at all when you pass the pointer around. This is not the root of the problems. – Lothar Dec 24 '17 at 05:25
  • 1
    This answer is nice as far as it goes, but the OP explicitly asked for information on `std::unique_ptr<...>` too. This is just a rant about `std::shared_ptr<...>`. – geometrian Jul 09 '21 at 01:20
  • 2
    @imallett No he did not. He asked about shared ptr and his examples all used shared ptr because this is the real important use case. For unique ptr he should watch "CppCon 2019: Chandler Carruth “There Are No Zero-cost Abstractions” on youtube. I will add this to the answer. – Lothar Jul 15 '21 at 08:32
  • 1
    I believe you made a mistake saying "and a pointer list to all weak references". As it may be a way of implementing shared/weak pointers, I think most of the time (see msvc & clang implementations for instance) it is done through a double counter (one for strong refs and one for weak ones). Control block (and object block when allocated through allocate_shared) is kept allocated until all strong and weak refs are destroyed. – Victor Drouin Aug 26 '21 at 08:00
27

As with all code performance, the only really reliable means to obtain hard information is to measure and/or inspect machine code.

That said, simple reasoning says that

  • You can expect some overhead in debug builds, since e.g. operator-> must be executed as a function call so that you can step into it (this is in turn due to general lack of support for marking classes and functions as non-debug).

  • For shared_ptr you can expect some overhead in initial creation, since that involves dynamic allocation of a control block, and dynamic allocation is very much slower than any other basic operation in C++ (do use make_shared when practically possible, to minimize that overhead).

  • Also for shared_ptr there is some minimal overhead in maintaining a reference count, e.g. when passing a shared_ptr by value, but there's no such overhead for unique_ptr.

Keeping the first point above in mind, when you measure, do that both for debug and release builds.

The international C++ standardization committee has published a technical report on performance, but this was in 2006, before unique_ptr and shared_ptr were added to the standard library. Still, smart pointers were old hat at that point, so the report considered also that. Quoting the relevant part:

“if accessing a value through a trivial smart pointer is significantly slower than accessing it through an ordinary pointer, the compiler is inefficiently handling the abstraction. In the past, most compilers had significant abstraction penalties and several current compilers still do. However, at least two compilers have been reported to have abstraction penalties below 1% and another a penalty of 3%, so eliminating this kind of overhead is well within the state of the art”

As an informed guess, the “well within the state of the art” has been achieved with the most popular compilers today, as of early 2014.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Could you please include some details in your answer about the cases I added to my question? – Venemo Mar 12 '14 at 10:03
  • This might have been true 10 or more years ago, but today, inspecting machine code is not as useful as the person above suggests. Depending on how instructions are pipelined, vectorized, ... and how the compiler/processor deals with speculation ultimately is how fast it is. Less code machine code doesn't necessarily mean faster code. The only way to determine the performance is to profile it. This can change on a processor basis and also per compiler. – Byron Jan 04 '19 at 17:58
  • 2
    An issue I've seen is that, once shared_ptrs are used in a server, then the usage of shared_ptrs begin to proliferate, and soon shared_ptrs become the default memory management technique. So now you have repeated 1-3% abstraction penalties which are taken over and over again. – Nathan Doromal Nov 27 '19 at 14:53
  • I think benchmarking a debug build is a complete and utter waste of time – Paul Childs Feb 24 '20 at 03:02
14

In other words, is my code going to be slower if I use smart pointers, and if so, how much slower?

Slower? Most likely not, unless you are creating a huge index using shared_ptrs and you have not enough memory to the point that your computer starts wrinkling, like an old lady being plummeted to the ground by an unbearable force from afar.

What would make your code slower is sluggish searches, unnecessary loop processing, huge copies of data, and a lot of write operations to disk (like hundreds).

The advantages of a smart pointer are all related to management. But is the overhead necessary? This depends on your implementation. Let's say you are iterating over an array of 3 phases, each phase has an array of 1024 elements. Creating a smart_ptr for this process might be overkill, since once the iteration is done you'll know you have to erase it. So you could gain extra memory from not using a smart_ptr...

But do you really want to do that?

A single memory leak could make your product have a point of failure in time (let's say your program leaks 4 megabytes each hour, it would take months to break a computer, nevertheless, it will break, you know it because the leak is there).

Is like saying "you software is guaranteed for 3 months, then, call me for service."

So in the end it really is a matter of... can you handle this risk? does using a raw pointer to handle your indexing over hundreds of different objects is worth loosing control of the memory.

If the answer is yes, then use a raw pointer.

If you don't even want to consider it, a smart_ptr is a good, viable, and awesome solution.

Community
  • 1
  • 1
Claudiordgz
  • 3,023
  • 1
  • 21
  • 48
  • 5
    ok, but valgrind is *good* in checking for possible memory leaks, so as long as you use it you should be safe™ – graywolf Mar 10 '14 at 12:43
  • @Paladin Yes, if you can handle your memory, `smart_ptr` are really useful for large teams – Claudiordgz Mar 10 '14 at 14:29
  • 3
    I use unique_ptr, it simplifies lot of things, but don't like shared_ptr, reference counting is not very efficient GC and its not perfect either – graywolf Mar 10 '14 at 14:33
  • 1
    @Paladin I try to use raw pointers if I can encapsulate everything. If it is something that I will be passing around all over the place like an argument then maybe I'll consider an smart_ptr. Most of my unique_ptrs are used in the big implementation, like a main or run method – Claudiordgz Mar 10 '14 at 15:38
  • @Lothar I see you paraphrased one of the things I said in your answer: `Thats why you should not do this unless the function is really involved in ownership management`... great answer, thanks, upvoted – Claudiordgz Dec 23 '17 at 02:11
6

Chandler Carruth has a few surprising "discoveries" on unique_ptr in his 2019 Cppcon talk. (Youtube). I can't explain it quite as well.

I hope I understood the two main points right:

  • Code without unique_ptr will (often incorrectly) not handle cases where owership is not passed while passing a pointer. Rewriting it to use unique_ptr will add that handling, and that has some overhead.
  • A unique_ptr is still a C++ object, and objects will be passed on stack when calling a function, unlike pointers, which can be passed in registers.
Caesar
  • 6,733
  • 4
  • 38
  • 44
-3

Just for a glimpse and just for the [] operator,it is ~5X slower than the raw pointer as demonstrated in the following code, which was compiled using gcc -lstdc++ -std=c++14 -O0 and outputted this result:

malloc []:     414252610                                                 
unique []  is: 2062494135                                                
uq get []  is: 238801500                                                 
uq.get()[] is: 1505169542
new is:        241049490 

I'm beginning to learn c++, I got this in my mind: you always need to know what are you doing and take more time to know what others had done in your c++.

EDIT

As methioned by @Mohan Kumar, I provided more details. The gcc version is 7.4.0 (Ubuntu 7.4.0-1ubuntu1~14.04~ppa1), The above result was obtained when the -O0 is used, however, when I use '-O2' flag, I got this:

malloc []:     223
unique []  is: 105586217
uq get []  is: 71129461
uq.get()[] is: 69246502
new is:        9683

Then shifted to clang version 3.9.0, -O0 was :

malloc []:     409765889
unique []  is: 1351714189
uq get []  is: 256090843
uq.get()[] is: 1026846852
new is:        255421307

-O2 was:

malloc []:     150
unique []  is: 124
uq get []  is: 83
uq.get()[] is: 83
new is:        54

The result of clang -O2 is amazing.

#include <memory>
#include <iostream>
#include <chrono>
#include <thread>

uint32_t n = 100000000;
void t_m(void){
    auto a  = (char*) malloc(n*sizeof(char));
    for(uint32_t i=0; i<n; i++) a[i] = 'A';
}
void t_u(void){
    auto a = std::unique_ptr<char[]>(new char[n]);
    for(uint32_t i=0; i<n; i++) a[i] = 'A';
}

void t_u2(void){
    auto a = std::unique_ptr<char[]>(new char[n]);
    auto tmp = a.get();
    for(uint32_t i=0; i<n; i++) tmp[i] = 'A';
}
void t_u3(void){
    auto a = std::unique_ptr<char[]>(new char[n]);
    for(uint32_t i=0; i<n; i++) a.get()[i] = 'A';
}
void t_new(void){
    auto a = new char[n];
    for(uint32_t i=0; i<n; i++) a[i] = 'A';
}

int main(){
    auto start = std::chrono::high_resolution_clock::now();
    t_m();
    auto end1 = std::chrono::high_resolution_clock::now();
    t_u();
    auto end2 = std::chrono::high_resolution_clock::now();
    t_u2();
    auto end3 = std::chrono::high_resolution_clock::now();
    t_u3();
    auto end4 = std::chrono::high_resolution_clock::now();
    t_new();
    auto end5 = std::chrono::high_resolution_clock::now();
    std::cout << "malloc []:     " <<  (end1 - start).count() << std::endl;
    std::cout << "unique []  is: " << (end2 - end1).count() << std::endl;
    std::cout << "uq get []  is: " << (end3 - end2).count() << std::endl;
    std::cout << "uq.get()[] is: " << (end4 - end3).count() << std::endl;
    std::cout << "new is:        " << (end5 - end4).count() << std::endl;
}

liqg3
  • 47
  • 4
  • I have tested the code now, it's just only 10% slow when using the unique pointer. – Mohan Kumar Jan 30 '19 at 22:11
  • 11
    never ever benchmark with `-O0` or debug code. The output will be [extremely inefficient](https://stackoverflow.com/q/53366394/995714). Always use at least `-O2` (or `-O3` nowadays because some vectorization aren't done in `-O2`) – phuclv Apr 06 '19 at 11:09
  • 1
    If you have time and want a coffee break take -O4 to get link time optimization and all the little tiny abstraction functions get inline and vanish. – Lothar Sep 03 '19 at 15:05
  • 1
    You should include a `free` call in the malloc test, and `delete[]` for new (or make variable `a` static), because the `unique_ptr`s are calling `delete[]` under the hood, in their destructors. – RnMss Jun 30 '20 at 03:46
  • 1
    @phuclv I disagree, ***both*** should be tested. **1.** Debug mode misses out optimisations which are often specific to the particular build platform and version, and can give you a "worst case" benchmark. **2.** Optimisations are easy for simple scripts but less prevalent in complex software with complex paths. I've seen numerous posters claiming super fast algorithms, only to see them later find the optimiser has just removed the entire test loop upon seeing that the output can be predetermined. **3.** Having software that runs considerably slower in debug mode is annoying to developers. – c z Sep 03 '21 at 13:12
  • @cz who cares about worst case benchmarks? Probably only RTOS applications. For most users only the optimized benchmark is useful. Who are annoyed about slow debug mode? MSVC debug mode may be 10 times slower because lots of STL debug code is injected and yet no one complains apart from you – phuclv Sep 04 '21 at 02:07
  • @Lothar with more recent versions of llvm, `-O4` doesn't include LTO anymore. See [this SO question](https://stackoverflow.com/questions/13924136/what-optimization-passes-are-done-for-o4-in-clang) for more info – ljleb Jan 23 '22 at 13:33