10

In another thread I started a discussion about Vectors and Arrays, in which I was largely playing devil's advocate, to push buttons. However, during the course of this, I stumbled onto a test case that has me a little perplexed, and I would like to have a real discussion about it, over the "abuse" I'm getting for playing devil's advocate, starting a real discussion on that thread is now impossible. However, the particular example has me intrigued, and I cannot explain it to myself satisfactorily.

The discussion is about the general performance of Vector vs Arrays, ignoring dynamic elements. Ex: obviously continually using push_back() in a vector is going to slow it down. We're assuming that the vector and array are pre-populated with data. The example I presented, and subsequently modified by an individual in the thread, was the following:

#include <iostream>
#include <vector>
#include <type_traits>
using namespace std;

const int ARRAY_SIZE = 500000000;

// http://stackoverflow.com/a/15975738/500104
template <class T>
class no_init_allocator
{
public:
    typedef T value_type;

    no_init_allocator() noexcept {}
    template <class U>
        no_init_allocator(const no_init_allocator<U>&) noexcept {}
    T* allocate(std::size_t n)
        {return static_cast<T*>(::operator new(n * sizeof(T)));}
    void deallocate(T* p, std::size_t) noexcept
        {::operator delete(static_cast<void*>(p));}
    template <class U>
        void construct(U*) noexcept
        {
            // libstdc++ doesn't know 'is_trivially_default_constructible', still has the old names
            static_assert(is_trivially_default_constructible<U>::value,
            "This allocator can only be used with trivally default constructible types");
        }
    template <class U, class A0, class... Args>
        void construct(U* up, A0&& a0, Args&&... args) noexcept
        {
            ::new(up) U(std::forward<A0>(a0), std::forward<Args>(args)...);
        }
};

int main() {
    srand(5);  //I use the same seed, we just need the random distribution.
    vector<char, no_init_allocator<char>> charArray(ARRAY_SIZE);
    //char* charArray = new char[ARRAY_SIZE];
    for(int i = 0; i < ARRAY_SIZE; i++) {
        charArray[i] = (char)((i%26) + 48) ;
    }

    for(int i = 0; i < ARRAY_SIZE; i++) {
        charArray[i] = charArray[rand() % ARRAY_SIZE];
    }
}

When I run this on my machine, I get the following terminal output. The first run is with the vector line uncommented, the second is with the array line uncommented. I used the highest level of optimization, to give the vector the best chance of success. Below are my results, the first two runs with the array line uncommented, the second two with the vector line.

//Array run # 1
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m20.287s
user    0m20.068s
sys 0m0.175s

//Array run # 2
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m21.504s
user    0m21.267s
sys 0m0.192s

//Vector run # 1
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m28.513s
user    0m28.292s
sys 0m0.178s

//Vector run # 2
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m28.607s
user    0m28.391s
sys 0m0.178s

That arrays outperform vectors does not surprise me, however, that the difference is on the order of 50% surprises me very much, I would expect that they would be negligible, and I feel like the nature of this test case me be obscuring the nature of the results. When you run this test on array sizes that are smaller, the performance differences dissipate dramatically.

My explanation:

The additional implementation instructions for vector are causing the vector instructions to align in memory poorly, perhaps even on this example, a split at a really bad point on 2 different "blocks". This is causing memory to jump back and forth between levels of cache vs data cache vs instruction cache more frequently than you would expect. I also suspect that the LLVM compiler may be exaggerating weaknesses, and optimizing poorly, due to some of the newer C++11 elements, though I have no reason for either of these explanations besides hypothesis and conjecture.

I am interested if A: that someone can replicate my results and B: if someone has a better explanation for how the computer is running this particular benchmark and why vector is so dramatically underperforming arrays in this instance.

My set up: http://www.newegg.com/Product/Product.aspx?Item=N82E16834100226

MobA11y
  • 18,425
  • 3
  • 49
  • 76
  • 14
    `-o3` should be `-O3`, you're not optimising. – Daniel Fischer May 08 '13 at 17:33
  • 2
    Interesting question, and so nice to see both source code, compiler flags and results, along with a detailed explanation. +1 for a well asked question. :) – jalf May 08 '13 at 17:35
  • I do not see any difference on VC++ – yngccc May 08 '13 at 17:40
  • I believe the -o3 vs -O3 comment has a point. I'm refactoring/testing. Usually llvm complains about errant flags, interesting. – MobA11y May 08 '13 at 17:41
  • @yngum: I'm not a windows guy, what compiler does VC++ use? I assume it's MS propietary over like MinGW. – MobA11y May 08 '13 at 17:42
  • 1
    Using `-O3` and changing `is_trivially_default_constructible` to `has_trivial_default_constructor` (using `libstdc++`), I get 0m26.250s for array and 0m27.438s for vector, so that's more or less the expected small difference. – Daniel Fischer May 08 '13 at 17:43
  • Inside your loops, there is not a whole lot going on, other than accessing the vector/array with the [] operator. For char *, `charArray[i]` just resolves to `charArray + i`, but for a vector, it is likely more complicated. For example, the GNU g++ vector implementation has this: `return *(begin() + __n)`. I bet if you put something much more complex on the right hand side of the assignment, which doesn't involve [], your timing differences will decrease. – Markku K. May 08 '13 at 17:47
  • @ChrisCM: VC++ *is* a compiler, they don't use someone else's. – Puppy May 08 '13 at 17:52
  • 1
    @Markku: What you outline is specifically what optimizations should fix though. – MobA11y May 08 '13 at 17:52
  • @DeadMG: go microsoft for getting it right then! It would figure, llvm, a primarily mac/xcode backed open source project, would be the culprit. I'm going back to ubuntu! Thank god I can install G++ on my mac! Thank you all for the discussion! – MobA11y May 08 '13 at 17:56
  • @ChrisCM: you are right, of course. It might be interesting to look at the assembly to see if clang is failing to inline the [] and begin() functions. – Markku K. May 08 '13 at 18:25
  • That would be interesting. I believe g++ assembler would be all but identical between the array and vector implementations. Any visible differences in the assembler within that loop would wrap up the issue quickly. – MobA11y May 08 '13 at 18:28

2 Answers2

9

A simpler explanation: you're building with optimisations disabled. You want -O3, not -o3.

I don't have clang available to exactly reproduce your tests, but my results are as follows:

//Array run # 1
$ g++ -std=c++11 -O3 test.cpp -o b.out && time ./b.out

real    0m25.323s
user    0m25.162s
sys 0m0.148s

//Vector run #1
$ g++ -std=c++11 -O3 test.cpp -o b.out && time ./b.out

real    0m25.634s
user    0m25.486s
sys 0m0.136s
Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • Yes, I believe between this and the other answer, lies the correct answer. Usually LLVM/CLang complains about errant flags. Don't know how I missed it. But I still get results that differ in the 10% range, which given your results are still suspicious. I think I will stick with gcc until Clang figures out c++11 – MobA11y May 08 '13 at 17:45
  • 4
    @ChrisCM: In this case, it's not an errant flag. You're telling it to call the output `3`, then overriding that with the later `-o b.out`. – Mike Seymour May 08 '13 at 17:46
  • LLVM-specific misoptimization will readily account for the fractional difference.# – Puppy May 08 '13 at 17:49
  • Yep I agree, the point here did fix most of the problem, but the rest of the problem is missoptimization by LLVM, as clearly with g++ the results are, and should be, more dependent on system noise than any difference in instructions or assembly. – MobA11y May 08 '13 at 17:54
8

I can guarantee that LLVM does infact misoptimize std::vector (if you are in fact optimising at all), at least as of right now. It does not correctly inline many of the function calls involved. You will get better performance with GCC.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • This was my suspicion as well. I'm going to wait and see what others say, but this at least has a little to do with it. – MobA11y May 08 '13 at 17:45
  • I'd stick with the Puppy on this one. – Shotgun Ninja May 08 '13 at 17:49
  • I'm going to accept this answer. My -O3 mistake was the primary source of the error, however g++ still optimizes all of the vector issues away, and I believe my version of LLVM is the culprit for the ~10% weakness of the vector, after proper optimization. – MobA11y May 08 '13 at 17:51
  • @ChrisCM: Google reported a 12% loss with LLVM, so that's well within expected bounds. – Puppy May 08 '13 at 17:53