How is the vector element access performance

Question

I have a function. It will access the same element in a vector lots of time. I am wondering the performance to access an element. For example

Void f()
{
   // v[i] appears many times, each time use v[i] to access it
   // the other way is introduce a variable to record it like 
   auto& e = v[i];
}

Which one is better? Or they are just same.

Don't guess, just try both and measure (with optimization enabled of course). If you are not able to measure a difference, it does not matter. — Baum mit Augen, Dec 02 '15 at 22:57
Write elegant code. Let the optimiser worry about performance. The time for *you* to worry about performance is when you have the most elegant algorithm, no blocking, good cache concurrency and your users are complaining about CPU load. Trust me, that time is years away. — Richard Hodges, Dec 02 '15 at 23:02
Note that using a reference may be dangerous as you can get a dangling reference if underlying memory is reallocated due to insert operations. — , Dec 02 '15 at 23:14
The short answer is "it depends". And without seeing the actual code, we really can't even guess. (Are there changes to the vector? Are there calls to code that might change the vector but happen not to?) — David Schwartz, Dec 03 '15 at 00:01

vsoftco · Answer 1 · 2015-12-02T23:46:18.157

In general, when you use v[i], you invoke the member function std::vector::operator[]. If you use it multiple times, then you call a function multiple times. Calling a non-inline function multiple times result in stack switching so it will take more time. Whereas storing the result (reference) in a variable calls the function only once.

Now, most decent compilers can optimize the call (or even inline it), so most likely you won't see a difference. However, the best way to do it is to test it with a profiler and see the results.

EDIT1 Please read this. It discusses something similar, where calling a function inside a loop condition multiple times may result in a decrease of performance.

EDIT2 I tested your code on my machine (gcc5, i5 8GB RAM) and timed it, code below. With optimizers turned on, there is no difference (g++). Without optimization, the reference version is twice as fast.

#include <iostream>
#include <chrono>
#include <vector>

auto timing = [](auto && F, auto && ... params)
{
    auto start = std::chrono::steady_clock::now();
    std::forward<decltype(F)>(F)(std::forward<decltype(params)>(params)...);
    return std::chrono::duration_cast<std::chrono::milliseconds>(
               std::chrono::steady_clock::now() - start).count();
};

std::vector<int> v(1024);

long long f(std::size_t i) // we pick the i-th element
{
    auto& e = v[i];
    long long sum = 0;
    for(volatile std::size_t k = 0 ; k < 1000000000; ++k)
        sum += e;
    return sum;
}

long long g(std::size_t i)
{
    long long sum = 0;
    for(volatile std::size_t k = 0 ; k < 1000000000; ++k)
        sum += v[i];
    return sum;
}

int main()
{
    for(std::size_t i = 0; i < 1024; ++i)
        v[i] = i * i;

    auto time0 = timing(f, 42);
    auto time1 = timing(g, 42);

    std::cout << time0 << std::endl;
    std::cout << time1 << std::endl;
}

Live on Coliru, with optimizations turned on Live on Coliru, no optimizations

"Calling a non-inline function multiple times result in context switching" - what is this nonsense? — Karoly Horvath, Dec 02 '15 at 23:18
@KarolyHorvath I used the wrong term, corrected. I don't know exactly how it's called, but basically you switch the stack frame. — vsoftco, Dec 02 '15 at 23:19
@vsoftco the optimiser will elide all redundant calls to operator[]. It's not a concern. — Richard Hodges, Dec 02 '15 at 23:34
Most likely the compiler will inline the call even if it's not marked `inline` explicitly — M.M, Dec 02 '15 at 23:46
@M.M That's what seem to be happening indeed, and also mentioned in the answer. On my machine, there is no difference when I turn on optimizations. — vsoftco, Dec 02 '15 at 23:46

alfalfasprout · Answer 2 · 2015-12-03T00:20:29.923

Believe it or not, it can make a huge difference (even accounting for cache-hits making access of the same element over and over very efficient).

Here's some test code:

#include <vector>
#include <cstdint>
#include <iostream>

static inline std::uint64_t RDTSC()
{
  unsigned int hi, lo;
  __asm__ volatile("rdtsc" : "=a" (lo), "=d" (hi));
  return ((uint64_t)hi << 32) | lo;
}

void v1(const std::vector<double> & v, std::vector<double> & ov)
{
    for(int i=0 ; i<100000000 ; ++i)
        ov.push_back(v[5]);
}

void v2(const std::vector<double> & v, std::vector<double> & ov)
{
    auto fixed_var = v[5];
    for(int i=0 ; i<100000000 ; ++i)
        ov.push_back(fixed_var);
}

void v3(const std::vector<double> & v, std::vector<double> & ov)
{
    const double fixed_var = v[5];
    for(int i=0 ; i<100000000 ; ++i)
        ov.push_back(fixed_var);
}

void flush_cache()
{
    //Flush L1 and L2 cache by thrashing it with garbage
    const int cache_size = 256*1024*1024;
    auto garbage = new char[cache_size];
    for(int i=0 ; i < 48; ++i)
    {
        for (int j=0 ; j<cache_size ; j++)
            garbage[j] = i*j;
    }
    delete[] garbage;
    std::cout << "flushed cache\n";
}

                 int main(void)
{
    std::vector<double> v;
    std::vector<double> ov;
    for(int i=0 ; i<10000000 ; ++i)
    {
        v.push_back(i/(i+1000000));
    }

    //try v1
    auto start = RDTSC();
    v1(v,ov);
    auto end = RDTSC();
    auto v1t = end-start;
    std::cout << "V1: 1.0x\n";

    //flush and clear
    ov.clear();
    flush_cache();

    //try v2
    start = RDTSC();
    v2(v,ov);
    end = RDTSC();
    auto v2t = end-start;
    std::cout << "V2: " << ((double)v2t)/v1t << "x\n";

    //flush and clear
    ov.clear();
    flush_cache();

    //try v3
    start = RDTSC();
    v3(v,ov);
    end = RDTSC();
    auto v3t = end-start;
    std::cout << "V3: " << ((double)v3t)/v1t << "x\n";
}

We see there's actually a huge difference between the naive approach and using a variable:

V1: 1.0x
flushed cache
V2: 0.221311x
flushed cache
V3: 0.222199x

I double checked the assembly to make sure stuff wasn't being optimized away since we weren't using the result.

We're also using the RTDSC instruction to ensure consistent timing in terms of CPU cycles.

The difference between V2 and V3 isn't significant: they trade places on any given run. But it's definitely a 70-80% speed improvement not accessing the array each time.

Note that if you don't flush the L1/L2 cache between runs and run V1 after V2 and V3 they'll all have similar timings. Thus, it's really important to flush the cache.

Some more testing reveals that g++ and c++ will not optimize this away (and just store the value in a register) but Intel C++ will. Go figure...

How is the vector element access performance

2 Answers2