Why calling via weak_ptr is so slow?

Question

I have read the question What's the performance penalty of weak_ptr? but my own tests show different results.

I'm making delegates with smart pointers. The simple code below shows reproduces the performance issues with weak_ptr. Can anybody tell me why?

#include <chrono>
#include <functional>
#include <iostream>
#include <memory>
#include <stdint.h>
#include <string>
#include <utility>

struct Foo
{
    Foo() : counter(0) { incrStep = 1;}

    void bar()
    {
        counter += incrStep;
    }

    virtual ~Foo()
    {
        std::cout << "End " << counter << std::endl;
    }
private:
    uint64_t counter;
    uint64_t incrStep;
};

void pf(const std::string &md, const std::function<void()> &g)
{
    const auto st = std::chrono::high_resolution_clock::now();
    g();
    const auto ft = std::chrono::high_resolution_clock::now();
    const auto del = std::chrono::duration_cast<std::chrono::milliseconds>(ft - st);
    std::cout << md << " \t: \t" << del.count() << std::endl;
}

And the test:

int main(int , char** )
{
    volatile size_t l = 1000000000ULL;
    size_t maxCounter = l;

    auto a = std::make_shared<Foo>();
    std::weak_ptr<Foo> wp = a;

    pf("call via raw ptr        ", [=](){
        for (size_t i = 0; i < maxCounter; ++i)
        {
            auto p = a.get();
            if (p)
            {
                p->bar();
            }
        }
    });

    pf("call via shared_ptr      ", [=](){
        for (size_t i = 0; i < maxCounter; ++i)
        {
            if (a)
            {
                a->bar();
            }
        }
    });

    pf("call via weak_ptr       ", [=](){
        std::shared_ptr<Foo> p;
        for (size_t i = 0; i < maxCounter; ++i)
        {
            p = wp.lock();
            if (p)
            {
                p->bar();
            }
        }
    });

    pf("call via shared_ptr copy", [=](){
        volatile std::shared_ptr<Foo> p1 = a;
        std::shared_ptr<Foo> p;
        for (size_t i = 0; i < maxCounter; ++i)
        {
            p = const_cast<std::shared_ptr<Foo>& >(p1);
            if (p)
            {
                p->bar();
            }
        }
    });

    pf("call via mem_fn         ", [=](){
        auto fff = std::mem_fn(&Foo::bar);
        for (size_t i = 0; i < maxCounter; ++i)
        {
            fff(a.get());
        }
    });

    return 0;
}

Results:

$ ./test
call via raw ptr            :   369
call via shared_ptr         :   302
call via weak_ptr           :   22663
call via shared_ptr copy    :   2171
call via mem_fn             :   2124
End 5000000000

As you can see, weak_ptr is 10 times slower than shared_ptr with copying and std::mem_fn and 60 times slower than using raw ptr or shared_ptr.get()

A `weak_ptr` needs to do a thread safe aquisition of a `shared_ptr` its bound to be slow. You should only use a `weak_ptr` when you can't know if the shared object has been destroyed or not. Otherwise use a *raw pointer*. — Galik, Feb 01 '16 at 16:03
Ok, I know it, but why copying of shared_ptr is not as slow as weak_ptr, after all, copying shared_ptr changes ref counter in thread-safe way too? — user2807083, Feb 01 '16 at 16:11
Slightly OT: when I tried this with gcc v5.3.0, the `mem_fn` part took no time at all, which suggested that it had optimized the thousand million calls into a simple one-time increment of the counter. So I changed counter to `volatile`, and then raw_ptr and shared_ptr cases took the same amount of time as shared_ptr copy and mem_fn. I'd take a look at how your compiler optimizes the raw_ptr and shared_ptr cases. (With v4.9, I got results similar to yours.) — rici, Feb 01 '16 at 16:27
I am using g++ (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6). Imho, better gcc 5.3 optimization is not good for this case, 'cause I try to cheat compiler and make him believe that Foo::bar is non-trivial function. At least until he can't optimize weak_ptr as well. — user2807083, Feb 01 '16 at 16:32
Also, compiled with clang-3.6 (and libc++), the results are 0, 0, 23178, 20972, 0. Again, made the 0s into reasonable numbers (2280, 2406, 23071, 20110, 2415). But it's interesting that the difference between locking a weak_ptr and copying a shared_ptr disappears. — rici, Feb 01 '16 at 16:46
I think you're seeing quirks of optimization. The weak_ptr case is the only case where the number of times the function is invoked cannot be deduced at compile time. — David Schwartz, Feb 01 '16 at 17:25
Interesting, when I use the -O2, -O1 or -Os optimization level `weak_ptr` speed call is even higher than when -O3 used, nearly 14000 ms. — user2807083, Feb 01 '16 at 17:30
dtbeaver: yes, my point was precisely that the compiler is applying some collection of optimizations, so you don't know what you are actually measuring in this benchmark (speed of shared_ptr call, or compiler optimization which eliminates the call?). For real uses in production code, the benchmark may not be even slightly relevant because the optimizations applied to the benchmarks might or might not apply to the real code (and more likely do not). In short, it is usually better to profile real code than to try to create micro-benchmarks. — rici, Feb 01 '16 at 17:41
yeah, it's really terrible how these optimizing compilers make all your benchmarks and manual optimizations obsolete... — Peter - Reinstate Monica, Mar 18 '18 at 22:03
@user2807083 Yes ;-) I have on occasion run benchmarks which suddenly "disappeared" with a new compiler version. Thing is, when you are forced to do something significant in the loop (like, produce random numbers), that stuff is likely to dominate the run time anyway, as opposed to pointer dereferenceing, function calls or whatever one tries to benchmark. So these queestions are often academic. I have also observed (in questions on SO) that modern CPUs are sensitive to apparently insignificant changes which align the code better or allow a non-obvious optimization. — Peter - Reinstate Monica, Mar 19 '18 at 08:26
Your numbers suggest that it's 100 times slower, not 10 times :) — Pavel P, Nov 27 '20 at 09:43

Galik · Accepted Answer · 2018-03-18T21:13:18.173

In trying to reproduce your test I realised that the optimizer might be eliminating more than it should. What I did was to utilize random numbers to defeat over-optimization and these results seem realistic with std::weak_ptr being nearly three times slower than the std::shared_ptr or its raw pointer.

I calculate a checksum in each test to ensure they are all doing the same work:

#include <chrono>
#include <memory>
#include <random>
#include <vector>
#include <iomanip>
#include <iostream>

#define OUT(m) do{std::cout << m << '\n';}while(0)

class Timer
{
    using clock = std::chrono::steady_clock;
    using microseconds = std::chrono::microseconds;

    clock::time_point tsb;
    clock::time_point tse;

public:

    void start() { tsb = clock::now(); }
    void stop()  { tse = clock::now(); }
    void clear() { tsb = tse; }

    friend std::ostream& operator<<(std::ostream& o, const Timer& timer)
    {
        return o << timer.secs();
    }

    // return time difference in seconds
    double secs() const
    {
        if(tse <= tsb)
            return 0.0;

        auto d = std::chrono::duration_cast<microseconds>(tse - tsb);

        return double(d.count()) / 1000000.0;
    }
};

constexpr auto N = 100000000U;

int main()
{
    std::mt19937 rnd{std::random_device{}()};
    std::uniform_int_distribution<int> pick{0, 100};

    std::vector<int> random_ints;
    for(auto i = 0U; i < 1024; ++i)
        random_ints.push_back(pick(rnd));

    std::shared_ptr<int> sptr = std::make_shared<int>(std::rand() % 100);
    int* rptr = sptr.get();
    std::weak_ptr<int> wptr = sptr;

    Timer timer;

    unsigned sum = 0;

    sum = 0;
    timer.start();
    for(auto i = 0U; i < N; ++i)
    {
        sum += random_ints[i % random_ints.size()] * *sptr;
    }
    timer.stop();

    OUT("sptr: " << sum << " " << timer);

    sum = 0;
    timer.start();
    for(auto i = 0U; i < N; ++i)
    {
        sum += random_ints[i % random_ints.size()] * *rptr;
    }
    timer.stop();

    OUT("rptr: " << sum << " " << timer);

    sum = 0;
    timer.start();
    for(auto i = 0U; i < N; ++i)
    {
        sum += random_ints[i % random_ints.size()] * *wptr.lock();
    }
    timer.stop();

    OUT("wptr: " << sum << " " << timer);
}

Compiler flags:

g++ -std=c++14 -O3 -g0 -D NDEBUG -o bin/timecpp src/timecpp.cpp

Example Output:

sptr: 1367265700 1.26869 // shared pointer
rptr: 1367265700 1.26435 // raw pointer
wptr: 1367265700 2.99008 // weak pointer

This doesn't answer the question. The question, as I read it, is "what makes weak_ptr slow?" Not "why doesn't [some code] show that weak_ptr is slow?" — Matthew James Briggs, Aug 20 '17 at 04:59
@MatthewJamesBriggs The way I read the question is "Why is it slow in my specific tests", because he links to a question that already explains why it is slow. But the OP is surprised that **his** tests are yielding *even slower* performance. And he wants to know why. The title is "Why calling via weak_ptr is **so** slow?" (emphasis on **so**) — Galik, Aug 20 '17 at 05:13

Why calling via weak_ptr is so slow?

1 Answers1

Linked