Why is std::vector twice as fast as raw array? Complete code included

Question

Results:

Vector time: 7051

Array time: 18944

I used MSVC release mode for this, compiled as 32 bit.

Before this test I was looking at the GCC source code for vector and was surprised because I thought operator[] checked for array-out-of-bounds, but it doesn't. However, I was not expecting the vector to be so fast?!

Complete code:

#include <iostream>
#include <vector>

int main(){
    const int size = 10000;
    unsigned long long my_array[size];
    std::vector<unsigned long long> my_vec;
    
    my_vec.resize(size);

    //Populate containers
    for(int i=0; i<size; i++){
        my_vec[i] = i;
        my_array[i] = i;
    }

    //Initialise test variables
    unsigned long long sum = 0;
    unsigned long long time = 0;
    unsigned long long start = 0;
    unsigned long long finish = 0;

    //Time the vector
    start = __rdtsc();
    for(int i=0; i<size; i++){
        sum += my_vec[i];
    }
    finish = __rdtsc();


    time = finish - start;
    std::cout << "Vector time: " << time << "     " << sum << std::endl;


    sum = 0;

    //Time the array
    start = __rdtsc();
    for(int i=0; i<size; i++){
        sum += my_array[i];
    }
    finish = __rdtsc();

    time = finish - start;
    std::cout << "Array time: " << time << "     " << sum << std::endl;

    int t = 8;
    std::cin >> t;
    return 0;
}

When I test it the array is always faster. Vector time: 83755, Array time: 69753. Of course the values vary between executions, but not much. — maddin45, Apr 06 '14 at 22:40
Nearly identical performance [on ideone](https://ideone.com/iv5fzb)... — Kerrek SB, Apr 06 '14 at 22:41
Theoretically there shouldn't be a difference as the vector just wraps a dynamically-allocated array and there's no further run-time checks on `operator[]`. Maybe it'd be illuminating to compare the generated assembly code for the two loops. — M.M, Apr 06 '14 at 22:44
For the array there were quite a few memory operations between the values being touched and the measures being made. For the vector the values were just touched, i.e., they have a higher likelihood to be in cache. Reverse the measurements and I'd expect them to be rather similar. — Dietmar Kühl, Apr 06 '14 at 22:44
I ran the test 5 times and the vector was always slower by same ratio..... — user997112, Apr 06 '14 at 22:44
@user997112: The point is that you should try to time the array before you time the vector. — celtschk, Apr 06 '14 at 22:45
@DietmarKühl forgive me if this is asking the obvious- but why are there different memory operations between the two? Aren't they both accessing an array element via a base address and offset? — user997112, Apr 06 '14 at 22:45
@maddin45: on your system there wasn't that much of a difference in the first place! — Dietmar Kühl, Apr 06 '14 at 22:45
On my PC, using VS2013, the second test was always a bit slower and sometimes really slow (10x). Didn't matter if that was the array or the vector test. — dyp, Apr 06 '14 at 22:47
@DietmarKühl Yes but the difference was always of the same ratio. Even when I used ten times the number of elements the ratio between vector and array were the same as in the results I posted above. — maddin45, Apr 06 '14 at 22:47
@user997112: memory is highly layered in modern CPUs. The fastest to access memory (CPU cache) is rather small (e.g., just 32kB). If the memory is in the cache, it is just faster to get than if it is on the next level. Since there is little space, it is likely that only the most recent values are cached and others need to be accessed. — Dietmar Kühl, Apr 06 '14 at 22:47
@user997112: In the case of the vector, you immediately read it after filling it. In the case of the array, there's a whole vector-reading loop in between the array filling and the array reading. — celtschk, Apr 06 '14 at 22:48
@dyp yup I agree- I had never thought of that. I will try and factor that in to my test! — user997112, Apr 06 '14 at 22:49
So basically- the cache is still full of the data from the vector and so a lot of cache misses occur? — user997112, Apr 06 '14 at 22:49
@user997112: it would be a plausible explanation that there are more cache misses on the array timing than on the vector timing. Whether that is the case, would require looking at the processor statistics. — Dietmar Kühl, Apr 06 '14 at 22:51
@maddin45: I could imagine that your CPU uses a different memory layout, e.g., has a much larger cache. Without looking at the processor statistics I can't explain the results you quote. — Dietmar Kühl, Apr 06 '14 at 22:53
Perhaps a better way to test which is faster than the other is to create two separate programs, where one is strictly a `standard C array` and the other is strictly a `std::vector`. Both programs have to run exactly the same tests algorithmically speaking. — CPlusPlus OOA and D, Apr 06 '14 at 22:53
Thinking about it, there's also the output in between, which will involve function calls (and even OS calls), and thus is another thing which reasonably might throw the array out of the cache. — celtschk, Apr 06 '14 at 22:57
You can't trust TSC measurements unless you pin the process to a specific core, and even then you need to make sure the intrinsic you're using also omits an instruction (e.g. CPUID) that prevents instruction reordering in the execution pipeline (so code you're trying to measure is serialised relative to the TSC read). This benchmark can't be trusted. — Tony Delroy, Apr 06 '14 at 23:00
CPUID is another unrelated assembly code instruction on x86 that happens to have execution pipeline reordering guarantees. See http://stackoverflow.com/questions/2918113/cpuid-before-rdtsc — Tony Delroy, Apr 06 '14 at 23:57

score 10 · Accepted Answer · answered Apr 06 '14 at 22:45

10

The following is using MSVC 2013.

For the vector:

0019138E  mov         edi,edi  
  for (int i = 0; i<size; i++){
00191390  lea         ecx,[ecx+20h]  
    sum += my_vec[i];
00191393  movdqu      xmm0,xmmword ptr [ecx-20h]  
00191398  paddq       xmm1,xmm0  
0019139C  movdqu      xmm0,xmmword ptr [ecx-10h]  
001913A1  paddq       xmm2,xmm0  
001913A5  dec         esi  
001913A6  jne         main+0F0h (0191390h)  
  }

For the array:

0019142D  lea         ecx,[ecx]  
  for (int i = 0; i<size; i++){
00191430  lea         ecx,[ecx+20h]  
    sum += my_array[i];
00191433  movdqu      xmm0,xmmword ptr [ecx-30h]  
00191438  paddq       xmm1,xmm0  
0019143C  movdqu      xmm0,xmmword ptr [ecx-20h]  
00191441  paddq       xmm2,xmm0  
00191445  dec         esi  
00191446  jne         main+190h (0191430h)  
  }

As you can see, the inner loops are identical. Actually, suspecting it was a hardware thing I swapped the two loops around and arrays came out faster to the same margin (so actually, neither is faster or slower than the other in the real world).

I predict this is some kind of CPU cache behavior: https://en.wikipedia.org/wiki/CPU_cache

answered Apr 06 '14 at 22:45

kvanbere

3,289
3
27
52

The cache will be full of the contents of the first data structure- so when the second executes it encounters a load of cache misses – user997112 Apr 06 '14 at 22:50
The first data structure will also experience misses (from other userspace programs, or whatever), so it's not a direct result of that. I am following up with some more info as soon as I confirm it. – kvanbere Apr 06 '14 at 22:54
I thought this may have been funnybugger'ey stuff to do with the lines like `my_vec[i] = i;` and their orderings effect on the cache, but I was unable to confirm it. I couldn't confirm it was a CPU node/cache mode (or similar) problem either, as both pages are read/write (and I confirmed in the assembly). – kvanbere Apr 06 '14 at 23:04
I just tried vector, array, vector and I got: 9028, 13222, 11502 (and when i repeat i get similar results where the first is always much faster than second 2) – user997112 Apr 06 '14 at 23:31

score 0 · Answer 2 · answered Apr 06 '14 at 23:03

0

We have two arrays of 80,000 bytes each. First, 160,000 bytes are filled with data in parallel. Then 80,000 of these are read, then the other 80,000 are read. Assuming a 128,000 byte cache:

When reading the first 32,000 bytes of vector, the data is not cached. The next 48,000 bytes are cached. Now the cache contains all of vector, and the last 48,000 bytes of array. But the bytes from array are the oldest, so as array is read from the start, the data at the end of it are thrown out. So all reads from array are uncached.

We therefore have 32,000 bytes uncached reads and 48,000 bytes cached reads for vector, vs. 80,000 bytes uncached reads for array.

That's for a cache size of 128,000 bytes. It will be different for other sizes. But then totally different things could happen. Your code could be switched to a different processor while it is running, at which point one processor might have to write the data to main memory, and the other processor reads it. In the other direction, the OS might have just realised that there is some action going on and turning the processor from power saving mode to some turbo mode.

Making a single measurement and drawing conclusions from it takes none of these things into account.

answered Apr 06 '14 at 23:03

gnasher729

51,477
5
75
98

2

But the vector is on the heap, so the array and vector do not share the same pages. Thus; there is no cache sharing between them. Modifying the code like so: `unsigned long long *my_array = new unsigned long long[size];` does **NOT** alter the performance. – kvanbere Apr 06 '14 at 23:06
@kvanberendonck: gnasher729 didn't claim there was any cache sharing, apart from the fact that of course all processes share the same complete processor cache. – celtschk Apr 06 '14 at 23:11
I realize now that I misunderstood, but my interpretation is he now reasons the cache is "miss-less"/fresh when the process starts. Is this the case? Wouldn't the cache be filled out from everything else running on the computer so there would be equal misses in both the first and second loops? – kvanbere Apr 06 '14 at 23:15
Well, his (implicit) assumption is that there's no interrupt during the run time of the complete code block (so nothing interferes with the cache in between), and he also ignores the output call in between. But he doesn't need to assume a clean cache in the beginning because the loop that *fills* both the array and the vector (which runs before the timing starts) will push anything out of the cache which had been there when reaching the code block. – celtschk Apr 06 '14 at 23:19
Puzzling, deleting the loop at the top which fills the arrays (bringing them into the cache) has no effect on timing. Also tried adding a `std::this_thread::yield();` between the two loops to rule out the cost of OS factors, but that also appears to yield no difference. – kvanbere Apr 06 '14 at 23:23
Maybe moving the first output after the second loop does? (Of course saving `time` in another variable and then outputting that) – celtschk Apr 06 '14 at 23:25
Doing so invokes loop fusion, and the two loops are fused together resulting in ridiculous timings (ie, array = 2, vector = 20k). – kvanbere Apr 06 '14 at 23:30
If you make `time` volatile, this might hinder loop fusion. – celtschk Apr 06 '14 at 23:32
I just tried vector, array, vector and I got: 9028, 13222, 11502 (and when i repeat i get similar results where the first is always much faster than second 2) – user997112 Apr 06 '14 at 23:32

Why is std::vector twice as fast as raw array? Complete code included

2 Answers2