How to find runtime efficiency of a C++ code

Question

I'm trying to find the efficiency of a program which I have recently posted on stackoverflow.

How to efficiently delete elements from a vector given an another vector

To compare the efficiency of my code with other answers I'm using chrono object.

Is it a correct way to check the runtime efficiency?

If not then kindly suggest a way to do it with an example.

#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <ctime>
using namespace std;

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    if(!vDestination.empty() && !vSource.empty())
    {
        for(auto i: vSource) {
            vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
        }
    }
}

int main() {
    vector<int> v1={1,2,3};
    vector<int> v2={4,5,6};
    vector<int> v3={1,2,3,4,5,6,7,8,9};
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
    std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() <<std::endl;
    for(auto i:v3)
        cout << i << endl;
    return 0;
}

Output

Time difference = 1472
7
8
9

Read this on how the `chrono` calls can be reordered, invalidating the measurements. http://stackoverflow.com/q/37786547/3747990 — Niall, Aug 25 '16 at 09:18
I think you would get better results executing the same code multiple times and then get the average. — Neijwiert, Aug 25 '16 at 09:18
For correct measurement you need at least three things: 1) warm up (do many similar operations before measurement), 2) find average result of many calls, 3) avoid super-optimization of compiler (if it can detect that you don't use results of calculations, it can just skip it). — Ilya, Aug 25 '16 at 09:21
Performance measurements are not easy. Consider using a library such as [nonius](https://nonius.io/) to get it right with much less effort. — nwp, Aug 25 '16 at 09:38
First of all, write yourself some function that fills that vectors with some random data, maybe with v1 and v2 being extracted somehow so that they are subsets, dunno what you need. Then, fill them with a ton of such data, not with 9 elements as you did. 9 is nothing, your time call will be far too imprecise with small numbers like that. Go into the millions or billions, basically in the largest number that your RAM can store without writing to disc and for which you have the patience to await the results. And you might want to enclose that in some for-loop to do several experiments in one go. — Aziuth, Aug 25 '16 at 10:14
I would suggest putting this at programming.se or why exactly it fits here? — dhein, Aug 25 '16 at 12:53
invasive measurement frameworks don't work at actually measuring performance. you should be using a noninvasive tool like linux perf or intel vtune to actually measure real performance without potentially getting in the way of the compiler — Steve Cox, Aug 25 '16 at 15:03
@Steve Cox could you clarify or back that up please? I don't understand the basis of your assertion. — boycy, Sep 06 '16 at 07:52

score 27 · Accepted Answer · edited May 23 '17 at 10:29

Is it a correct way to check the runtime efficiency?

Looks like not the best way to do it. I see the following flaws in your method:

Values are sorted. Branch prediction may expose ridiculous effects when testing sorted vs. unsorted values with the same algorithm. Possible fix: test on both sorted and unsorted and compare results.
Values are hard-coded. CPU cache is a tricky thing and it may introduce subtle differences between tests on hard-coded values and real-life ones. In a real world you are unlikely to perform these operations on hard-coded values, so you may either read them from a file or generate random ones.
Too few values. Your code's execution time is much smaller than the timer precision.
You run the code only once. If you fix all other issues and run the code twice, the second run may be much faster than the first one due to cache warm-up: subsequent runs tend to have less cache misses than the first one.
You run the code once on fixed-size data. It would be better to run otherwise correct tests at least four times to cover a Cartesian product of the following parameters:
- sorted vs. unsorted data
- v3 fits CPU cache vs. v3 size exceeds CPU cache. Also consider cases when (v1.length() + v3.length()) * sizeof(int) fits the cache or not, (v1.length() + v2.length() + v3.length()) * sizeof(int) fits the cache or not and so on for all combinations.

Your notes on the use of the cartesian product are excellent. — Niall, Aug 25 '16 at 09:36
A good link for 2) may be [this](http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of). — Ruslan, Aug 25 '16 at 09:59
Keep in mind there are multiple caches in a CPU. L1 caches are in the neighborhood of 128kb, requiring over 32,000 elements to fill the cache. L2 CPU caches are 1MB in size, you'll need over 250 thousand elements in the list to break the cache size. It may also be interesting to break the L3 cache sized at 8MB and see the effects of going out to RAM on performance. — pensono, Aug 25 '16 at 18:23

rustyx · Answer 2 · 2016-08-25T10:29:54.483

The biggest issues with your approach are:

1) The code you're testing is too short and predictable. You need to run it at least a few thousand times so that there is at least a few hundred milliseconds between measurements. And you need to make the data set larger and less predictable. In general, CPU caches really make accurate measurements based on synthetic input data a PITA.

2) The compiler is free to reorder your code. In general it's quite difficult to ensure the code you're timing will execute between calls to check the time (and nothing else, for that matter). On the one hand, you could dial down optimization, but on the other you want to measure optimized code.

One solution is to turn off whole-program optimization and put timing calls in another compilation unit.

Another possible solution is to use a memory fence around your test, e.g.

    std::atomic_thread_fence(std::memory_order_seq_cst);

(requires #include <atomic> and a C++11-capable compiler).

In addition, you might want to supplement your measurements with profiler data to see how efficiently L1/2/3 caches are used, memory bottlenecks, instruction retire rate, etc. Unfortunately the best tool for Intel x86 for that is commercial (vtune), but on AMD x86 a similar tool is free (codeXL).

score 2 · Answer 3 · edited May 23 '17 at 12:26

You might consider using a benchmarking library like Celero to do the measurements for you and deal with the tricky parts of performance measurements, while you remain focused on the code you're trying to optimize. More complex examples are available in the code I've linked in the answer to your previous question (How to efficiently delete elements from a vector given an another vector), but a simple use case would look like this:

BENCHMARK(VectorRemoval, OriginalQuestion, 100, 1000)
{
    std::vector destination(10000);
    std::generate(destination.begin(), destination.end(), std::rand);
    std::sample(destination.begin(), destination.end(), std::back_inserter(source), 
        100, std::mt19937{std::random_device{}()})    

    for (auto i: source)
        destination.erase(std::remove(destination.begin(), destination.end(), i), 
            destination.end());    
}

How to find runtime efficiency of a C++ code

3 Answers3