Why is an array so much faster than a vector?

Question

Is this a fair test for comparing a vector with an array? The difference in speed seems too large. My test suggests the array is 10 to 100 times faster!

#include "stdafx.h"
#include <iostream>
#include <vector>
#include <windows.h>
#include <stdint.h>

using namespace std;

double PCFreq = 0.0;
__int64 CounterStart = 0;

using namespace std;

void StartCounter()
{
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
    std:cout << "QueryPerformanceFrequency failed!\n";

    PCFreq = double(li.QuadPart)/1000000000;

    QueryPerformanceCounter(&li);
    CounterStart = li.QuadPart;
}
double GetCounter()
{
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return double(li.QuadPart-CounterStart)/PCFreq;
}

int _tmain(int argc, _TCHAR* argv[])
{
    //Can do 100,000 but not 1,000,000
    const int vectorsize = 100000;
    cout.precision(10);

    StartCounter();
    vector<int> test1(vectorsize);
    for(int i=0; i<vectorsize; i++){
        test1[i] = 5;
    }
    cout << GetCounter() << endl << endl;


    StartCounter();
    int test2[vectorsize];
    for(int i=0; i<vectorsize; i++){
        test2[i] = 5;
    }
    cout << GetCounter() << endl << endl;

    cout << test2[0];

    int t = 0;
    cin >> t;
    return 0;
}

Look at the assembly code. Chances are, almost everything is optimized out in a release build. — Retired Ninja, Dec 15 '12 at 18:19
Indeed, it's very likely that your second loop is being completely removed by the compiler. — Mysticial, Dec 15 '12 at 18:19
When I run in release the second loop takes 0 "nanoseconds". Why is it being removed- because test2 just isnt being used? — user997112, Dec 15 '12 at 18:21
@user997112 Correct. It's called [Dead Code Elimination](http://en.wikipedia.org/wiki/Dead_code_elimination). Here's another example: http://stackoverflow.com/questions/8841865/how-does-gcc-optimize-c-code — Mysticial, Dec 15 '12 at 18:22
I'm unable to reproduce your numbers. Both times are coming to `0` for me. The compiler is generating a `rep stosd` instruction for both loops. So it's extremely fast. You need an outer-loop to make the benchmark take more time. I suspect your benchmark is taking less than a millisecond - which in that case, it's useless because that's much less than the uncertainty. — Mysticial, Dec 15 '12 at 18:30
@user997112: The fact that you see this sort of difference probably indicates that you run some sort of debug build, non-optimized (or even deliberately de-optimized) and heavily overloaded with assertions (iterator checks and stuff like that). It makes absolutely no sense to run any comparisons in debug builds. — AnT stands with Russia, Dec 15 '12 at 18:33
One difference is that the vector elements are first initialized to zero in the constructor, and then assigned the value 5. Try `vector test1(vectorsize, 5);` instead. — Bo Persson, Dec 15 '12 at 18:35
@BoPersson that's it! I removed the array/vector creation statements from the timings and the vector is faster?! — user997112, Dec 15 '12 at 18:37
The vector allocates its memory on the heap, while the static array is stored on the stack. If you want a fair comparison, you should either move the memory allocation out of the measurement or allocate both blocks on the heap. — Niki, Dec 15 '12 at 18:43
Array *is* faster than a vector, and it is indeed tricky to benchmark properly. The example used in this test (i.e. copying the same value into all elements sequentially) is simply not good for demonstrating the difference. Here is the older question about the nature of that difference: http://stackoverflow.com/questions/2740020/c-stl-array-vs-vector-raw-element-accessing-performance. By tailoring your test case to better exploit that difference you can easily show that array is faster. — AnT stands with Russia, Dec 15 '12 at 19:04

Mysticial · Accepted Answer · 2012-12-16T21:26:28.473

12

It depends on what you are comparing.

Your benchmark measures both setup time and access times together. It's doubtless that std::vector has a more expensive setup time. This is because it needs to allocate memory, and then (by necessity of the standard) call default constructors on all the elements. Which for a POD type, means zeroing.

So if you're trying to measure access times, then no your benchmark isn't accurate.

Here's some numbers to digest:

Original Code:

StartCounter();
vector<int> test1(vectorsize);

for(int i=0; i<vectorsize; i++){
    test1[i] = 5;
}
cout << GetCounter() << endl << endl;

Time: 444353.5206

Start timing after declaring and initializing the vector:

vector<int> test1(vectorsize);

StartCounter();
for(int i=0; i<vectorsize; i++){
    test1[i] = 5;
}
cout << GetCounter() << endl << endl;

Time: 15031.76101

And for the array:

StartCounter();
int test2[vectorsize];
for(int i=0; i<vectorsize; i++){
    test2[i] = 5;
}
cout << GetCounter() << endl << endl;

Time: 38129.345

The times are about the same regardless of whether the declaration is timed. This is likely because stack allocation is done all at once upon entry to the function.

Basically, the vector memory allocation and initialization is taking a disproportionate amount of time. But the actual loop is fast.

I'll also note that your current benchmark framework is still sightly flawed. You only make one pass over each array. So cache-effects and lazy-allocation will be significant.

The reason why the array is now slower is likely due to lazy-allocation. The array is allocated, but it hasn't been committed yet. Lazy allocation means that it is committed upon first access - which involves a page-fault and a context-switch to the kernel.

Here's a fairer test with an outer loop to increase the benchmark time:

vector<int> test1(vectorsize);

StartCounter();
for (int c = 0; c < 10000; c++){
    for(int i=0; i<vectorsize; i++){
        test1[i] = 5;
    }
}
cout << GetCounter() << endl << endl;

Time: 227330454.6

int test2[vectorsize];
memset(test2,0,sizeof(test2));

StartCounter();
for (int c = 0; c < 10000; c++){
    for(int i=0; i<vectorsize; i++){
        test2[i] = 5;
    }
}
cout << GetCounter() << endl << endl;
cout << test2[0];

Time: 212286228.2

So no an array is NOT faster than a vector for steady-state access. It's just tricky to benchmark properly.

edited Dec 16 '12 at 21:26

answered Dec 15 '12 at 18:43

Mysticial

464,885
45
335
332

what about starting the array counter after the declaration also? to make both loops start the counter at the same point in time. – Syntactic Fructose Dec 15 '12 at 18:44
@Need4Sleep I did it both ways. There was no performance difference. That's because the array is on the stack and is allocated upon entry to the function. – Mysticial Dec 15 '12 at 18:45
Removing both initializations from the timing, why would the vector then be faster than the static array loop? Im looking at the release assembler and the static array loop only has 2x MOVs, 1x LEA and 1 repo stos instructions? – user997112 Dec 15 '12 at 18:45
@user997112 That's probably because the array probably isn't paged yet due to lazy allocation. [So the OS needs to zero it upon committing it due to security reasons.](http://stackoverflow.com/questions/8029584/why-does-malloc-initialize-the-values-to-0-in-gcc) – Mysticial Dec 15 '12 at 18:46
I actually just ran the benchmark a few more times and the results are very chaotic. The benchmark is simply too short. But overall these numbers here are representable of the average case. – Mysticial Dec 15 '12 at 18:51
@Mysticial. Thanks. I just tested one last thing. If I create an int array on the heap using int* test3 = new int[vectorsize] and then do the same test, with the timing not including initialization I get much slower performance. The assembler output shows no instructions for the FOR loop?? – user997112 Dec 15 '12 at 18:52
@user997112 Again it goes back to lazy allocation. Until you access the memory the first time, it is not really "allocated". So its customary to manually zero all newly allocated memory to remove this effect. – Mysticial Dec 15 '12 at 18:57
Anyways, I have a final in half an hour. So I won't be able to respond for a few hours. – Mysticial Dec 15 '12 at 18:58
Array *is* faster than a vector, and it is indeed tricky to benchmark properly. The example used in this test (i.e. copying the same value into all elements sequentially) is simply not good for demonstrating the difference. Here is an older question about the nature of that difference: http://stackoverflow.com/questions/2740020/c-stl-array-vs-vector-raw-element-accessing-performance. By tailoring your test case to better exploit that difference you can easily show that array is faster. – AnT stands with Russia Dec 15 '12 at 19:06
@AndreyT how difficult would it be to create a vector class which does not initialize values (as seen in this example if the timing includes the vector creation)? – user997112 Dec 15 '12 at 19:11
Also, it is not practically possible for a vector to be faster than an array. If you results indicate that vector is faster, it is just a random fluke in your measurements. – AnT stands with Russia Dec 15 '12 at 19:11
@user997112: There's nothing difficult about it. Just don't initialize the values in your vector class and they will remain uninitialized. `std::vector`, unfortunately, unconditionally initializes the values. With `int` values and `std::vector` you can try a hack: instead of setting the vector's *size* try setting its *capacity* (keeping the size at `0`). This will technically allocate an *uninitialized* array. But keep in mind that in debug builds any attempts to access beyond the current size (which must stay at `0`) will probably trigger assertions. – AnT stands with Russia Dec 15 '12 at 19:13
@AndreyT I'll take a closer look at your answer after my exam. Since at first glance an extra move (which I'm not 100% convinced is necessary will make a difference in any real application. I'll also update my answer to mention that the setup cost of a vector is significant if that matters in an application that does little raw access. – Mysticial Dec 15 '12 at 19:28
@AndreyT Okay, that extra move in your answer only applies to the *first* access. After that, any self-respecting compiler will keep it in register and thus it effectively becomes a pointer. So we're both right. There *is* an extra move - but it's a startup cost. The steady-state is still the same. – Mysticial Dec 15 '12 at 22:33
@Mystical: Demand zero paged pages aren't actually zeroed on demand. The kernel zeroes them in the background and just allocates one from the previously background-zeroed pages. The cost you're thinking of is the 12000 cycles to interrupt, go through the paging code, add a new page into the process address space, flush the MMU and return back to usermode. Also, it's quite unlikely that this is the case. Stack pages are almost always paged in, as are heap allocations you just obtained. std::vector doesn't allocate from the kernel. It allocates from the heap. – SecurityMatt Dec 16 '12 at 00:47
@SecurityMatt That's interesting. Which seems to run contrary to what I've seen. When I allocate + commit 60GB on my workstation, it takes a full 30 seconds. But freeing it takes 5 seconds. Could this be OS dependent? – Mysticial Dec 16 '12 at 01:10
1

@Mystical If you overallocate you might run the kernel zero pool down to empty, at which point you might find your thread descheduled until the pool has replenished. A better test is to try timing a one page alloc versus a one page free followed with a thread sleep between test instances to give the zero pool time to replenish. This behaviour has been the case since at least Windows 2000 and is documented in the Windows Internals series of books. – SecurityMatt Dec 16 '12 at 21:15
@SecurityMatt Ah ic. That would explain it - since 60GB (out of a 64GB machine) is likely to empty to the pool. I guess I learned something new today. :) – Mysticial Dec 16 '12 at 21:18

Why is an array so much faster than a vector?

1 Answers1