6

TL;DR Is the optimizer of VS2013 confused or are my measurements wrong or does the global dummy in fact need to be volatile to make the test valid or ____ ?

Disclaimer: This is mostly out of "academic" interest, I would not expect the differences I see to really affect any production code.


Intro: Some recent measurements of mine then led me to this question because I saw significant differences between std::vector<std::unique_ptr<T> > and boost::ptr_vector on VS2013. (also see comments there)

It would appear that for my specific test case, accessing elements in a boost::ptr_vector can be 50% faster than using a vector of unique_ptr!

My Test code is here: http://coliru.stacked-crooked.com/a/27dc2f1b91380cca (I'll refrain from also including it in this question, I'll include snippets below)

  • gcc 4.8 doesn't report any differences, so this is a VS2013 thing.

    Start...
    The timings are as follows for accessing all (1000000) elements 200 times:
    * St6vectorISt10unique_ptrIjSt14default_deleteIjEESaIS3_EE: 1764 ms
    * N5boost10ptr_vectorIjNS_20heap_clone_allocatorESaIPvEEE: 1781 ms
    Dummy output: 500000
    
  • My Timings for exactly the test code linked to are:

    Start...
    The timings are as follows for accessing all (1.000.000) elements 200 times:
    * class std::vector<....>: 344 ms
    * class boost::ptr_vector<unsigned int,....>: 216 ms
    Dummy output: 500.000
    

The test loop looks like this, I'll also keep the lengthy comment there which explains what I see:

template<typename C>
void RunContainerAccess(C& c) {
    for (size_t i = 0; i != loop_count; ++i) {
        for (auto& e : c) {
            // This is relevant: 
            // If the if-condition is present, VC++2013 will show 
            // approx. the same runtime for both cases. However,
            // if the only line in this loop is assigning the element
            // to the pGlobalDummy, *then* ptr_vector will run approx. 50%
            // faster than the unique_vector version!
            //
            // g++-4.8 does not show this behaviour
            //
            // Note: VS2013 commmand line: (release; /O2; no whole prg opt)
            //   /GS /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\simple.pch" 
            //
            // Note: http://coliru.stacked-crooked.com/ command line:
            //   g++-4.8 -std=c++11 -O2 -Wall -pedantic -pthread main.cpp && ./a.out

            // if (pGlobalDummy)
                pGlobalDummy = PtrShim(e);
        }
    }
}

If the only line in the loop is accessing the element (putting the ptr into a global dummy), then it would appear that the VS2013 optimizer does something weird. When the if (pGlobalDummy) is present, both cases are the same.

Can anyone share some light on this?

Thanks to Howard's answer I did find that adding volatile to the global dummy make a difference, i.e. when the global dumy is volatile like so:

extern MyType* volatile pGlobalDummy;
MyType* volatile pGlobalDummy = nullptr;

The loops run a bit slower, but run exactly the same. Should volatile make a difference here? That is, is the test even valid without volatile?

Community
  • 1
  • 1
Martin Ba
  • 37,187
  • 33
  • 183
  • 337
  • Any idea why gcc looks so slow here? It should be similar to msvc in performance. –  Jan 16 '14 at 21:09
  • @jdv-JandeVaan: The gcc timings are from the online compiler - the MSVC timings are from my machine :-) – Martin Ba Jan 17 '14 at 05:31

1 Answers1

6

I found a bug in your test, which gives permission for optimizers to optimize in different and unpredictable ways. I don't know for sure that this is impacting your results. But it sure impacted mine.

I'm using tip-of-trunk clang + libc++ -O3.

When I run your code unmodified I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 0 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 0 ms
Dummy output: 500,000

I changed the output units to nanoseconds and got:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 32 ns
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 32 ns
Dummy output: 500,000

Suspicious, I inserted volatile here:

extern MyType* <ins>volatile</ins> pGlobalDummy;
MyType* <ins>volatile</ins> pGlobalDummy = nullptr;

but no change.

Then I noted that time[2] isn't being initialized, so I:

chron::nanoseconds time[2]<ins> = {}</ins>;

That did it. Now setting units back to milliseconds I get:

Start...
The timings are as follows for accessing all (1,000,000) elements 200 times:
* NSt3__16vectorINS_10unique_ptrIjNS_14default_deleteIjEEEENS_9allocatorIS4_EEEE: 394 ms
* N5boost10ptr_vectorIjNS_20heap_clone_allocatorENSt3__19allocatorIPvEEEE: 406 ms
Dummy output: 500,000

So I'm curious, if you explicitly zero your time[2], you may need to:

chron::nanoseconds time[2] = {chron::nanoseconds(0), chron::nanoseconds(0)};

does this impact the results you are seeing?

Clarification

The std::chrono::duration default constructor is specified as:

constexpr duration() = default;

This will default-initialize the duration's rep if the client does not specify list-initialization, e.g.:

chrono::nanoseconds ns;  // default-initialized

When rep is an arithmetic type, no initialization is performed ([dcl.init]/p7/b3).

If the client list-initializes, e.g.:

chrono::nanoseconds ns{};  // list-initialized

Then rep is value-initialized ([dcl.init.list]/p3/b7), and for arithmetic types, value-initialization is the same thing as zero-initialization ([dcl.init]/p8/b4).

Full working example:

#include <iostream>
#include <chrono>

int
main()
{
    std::chrono::nanoseconds n1;
    std::chrono::nanoseconds n2{};
    std::chrono::nanoseconds n3 = {};
    std::cout << "n1 = " << n1.count() << "ns\n";
    std::cout << "n2 = " << n2.count() << "ns\n";
    std::cout << "n3 = " << n3.count() << "ns\n";
}

For me, when compiled with -O0 I get:

n1 = 0ns
n2 = 0ns
n3 = 0ns

But compiling the same thing with -O3, this changes to:

n1 = 32ns
n2 = 0ns
n3 = 0ns
Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • I'm confused. [`duration`](http://en.cppreference.com/w/cpp/chrono/duration/duration) is a proper class and it doesn't zero initialize? The [boost implementation](http://www.boost.org/doc/libs/1_55_0/boost/chrono/duration.hpp) does seem to zero initialize it. – Martin Ba Jan 17 '14 at 05:39
  • I've updated my answer with details about the default constructor of `duration`. – Howard Hinnant Jan 17 '14 at 17:49
  • @MartinBa Correct: the default constructor is defaulted in-class and is therefore trivial when the duration Rep is trivially constructible. Having the default initializer zero initialize would impose an expense when sometimes we don't want it. It's common for types like `duration` to have the same initialization behavior as the built-in arithmetic types. As another example, when I wrote 2d and 3d vector types for my own use I ensured they had trivial default constructors for performance reasons. – bames53 Jan 17 '14 at 18:07
  • I set the times to zero - no change. But your other change, `volatile` did the trick. If the global variable is marked volatile, the times are the same. (~ 480ms with, vs 360ms/230ms without) ... I updated the question accordingly. – Martin Ba Jan 17 '14 at 19:09
  • Howard, thanks for your help here, just to let you know: The VS implementation of duration *does* value initialize -- `: _MyRep()` the duration. The boost implementation uses `rep_(...::zero())` so both do init the rep. – Martin Ba Jan 18 '14 at 21:05