How can several comparisons be slower than some calculations?

Question

We were developing a piece of code that would check whenever the user should not be allowed to get into a sector within a time period, one of my colleagues created a function which in the code below is the isAllowed and contains several comparisons, I took a different approach which is the function isAllowed2 which uses the amount of seconds between time periods.

At first we had no doubt that his function would be faster, but that is not true when actually running the code and comparing the speed, even if the difference is something we can completely ignore, we would like to know why is it that the one that "should" be faster is in fact slower.

Considering the following code:

#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

struct timing {
    short hour;
    short minute;
};

bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

int main() {
    //debug messages
    std::string okay = " - ok";
    std::string error = " - er";

    std::string receive = " - allowed";
    std::string notReceive = " - denied";

    //testing times
    int const testDataCount = 5;
    timing from[testDataCount] = {
        { 16, 30 },
        { 8,  30 },
        { 10, 30 },
        { 0, 30 },
        { 0, 0 }
    };
    timing to[testDataCount] = {
        { 8,  30 },
        { 20, 0 },
        { 20, 0 },
        { 6, 0 },
        { 7, 0 }
    };

    for (int i = 0; i < testDataCount; i++) {
        std::cout << i + 1 << ": " << from[i].hour << ":" << from[i].minute << " to " << to[i].hour << ":"
            << to[i].minute << std::endl;
    }

    //test current times
    timing current[5] = {
        { 12, 0 },
        { 23, 0 },
        { 17, 30 },
        { 15, 12 },
        { 0, 20 }
    };

    bool ergValues[][testDataCount] = {
        { true,  false, false, true, true },
        { false, true,  true, true, true },
        { false, false, false, true, true },
        { true,  false, false, true, true },
        { false,  true, true, true, false }
    };

    long totalNs1 = 0;
    long totalNs2 = 0;

    for (int i = 0; i < 4; i++) {
        std::cout << std::endl << i + 1 << ". Test: " << current[i].hour << ":" << current[i].minute << std::endl;
        for (int j = 0; j < testDataCount; j++) {

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            bool response = isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            bool response2 = isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
                << (ergValues[i][j] ? receive : notReceive) << std::endl;
        }
    }

    std::cout << "\r\ntotalNs1 = " << totalNs1 << "\r\ntotalNs2 = " << totalNs2 << "\r\n\r\n";

    return 0;
}

The result would obviously always differ, but no matter what the totalNs2 would always be smaller than the totalNs1.

Ex:

totalNs1 = 38796
totalNs2 = 25913

I tested this on a AMD Phenom II X4 and an Intel i7-3770, both on Windows 10 and Debian 8, with quite similar results.

So finally the question is, why is the function isAllowed2 faster than the isAllowed?

Note: This is mostly a curiosity question, if someone thinks the title or the tags are not the most appropriate please let me know and I'l change them accordingly, please excuse any eventual grammar errors as English is not my native language.

EDIT

Meanwhile I've been researching further based on all the suggestions and comments, including this incredibly detailed answer, after understanding how inaccurate micro-benchmarking can be (a HUGE thanks to Baum mit Augen for linking this amazing talk, which helped a lot) I ended up using the Google microbenchmark library to get more "accurate" results, it seems that the isAllowed function is in fact faster (compiled without optimization) as shown in the output from the library.

Run on (8 X 2395 MHz CPU s)
-----------------------------------------------------------------------
Benchmark                                Time           CPU Iterations
-----------------------------------------------------------------------
BM_isAllowed/2/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/4/min_time:2.000           22 ns         22 ns  137846154
BM_isAllowed/8/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/16/min_time:2.000          22 ns         22 ns  128000000
BM_isAllowed/22/min_time:2.000          22 ns         22 ns  137846154
BM_isAllowed2/2/min_time:2.000          24 ns         24 ns  112000000
BM_isAllowed2/4/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/8/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/16/min_time:2.000         24 ns         24 ns  119466667
BM_isAllowed2/22/min_time:2.000         24 ns         24 ns  119466667

Note: As Martin Bonner pointed out, the isAllowed function seems to have a logic flaw, don't use it production code.

Below you can find the code I used to do this benchmark, please let me know if there are any flaws in it as I'm not familiar with the Google library.

Important, this code was compiled with Visual Studio 2015 and optimization should be disabled for the section that we want to test.

#include "benchmark/benchmark.h"

using namespace std;
using namespace benchmark;

#pragma optimize( "[optimization-list]", {on | off} )

volatile const long extraDay = 24 * 60 * 60;

#pragma optimize( "", off )

struct timing {
    short hour;
    short minute;
    timing(short hour, short minute) : hour(hour), minute(minute) {}
};

static void BM_isAllowed(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b = !(((from.hour > to.hour && (current.hour >= from.hour || current.hour <= to.hour)) ||
            (current.hour >= from.hour && current.hour <= to.hour)) &&
            !(current.minute > from.minute && current.minute < to.minute));
    }
}

static void BM_isAllowed2(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b;
        long secsFrom = secsFrom = (from.hour * 3600) + (from.minute * 60);
        long secsTo = (to.hour * 3600) + (to.minute * 60);
        long secsCurrent = (current.hour * 3600) + (current.minute * 60);

        if (secsFrom > secsTo)
            secsTo += extraDay;
        if (secsCurrent > secsFrom && secsCurrent < secsTo)
            b = false;
        else
            b = true;
    }
}
#pragma optimize( "", on ) 

BENCHMARK(BM_isAllowed)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK(BM_isAllowed2)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK_MAIN();

Look at the assembly code ;) Did you compile with optimization flags? — Holt, Mar 06 '17 at 16:08
And that's why you don't guess about performance. Integer arithmetic is cheap and branches are somewhere between literally free and ridiculously expensive, inclusively. So predicting stuff like that is hard to say the least. — Baum mit Augen, Mar 06 '17 at 16:11
You will be interested in reading [this answer about branch prediction](http://stackoverflow.com/a/11227902/7359094). It may be one of the highest voted answers on Stack Overflow. The question is definitely the highest voted question. — François Andrieux, Mar 06 '17 at 16:14
@Holt Yeah always compiled with all the optimization options set for speed, giving a look at the assembly code would probably be the best way to find out the "why" but I gotta admit I'm a bit rusty on assembly! — Mike, Mar 06 '17 at 16:14
FYI, with clang 8.0 on macOS Sierra using the '-O3' flag the first version has up to five branches, and the second version is branchless. — Alnitak, Mar 06 '17 at 16:14
And for me, both, `totalNs1`, and `totalNs2` were equal to 0 (even when running unoptimized code). Which implies, that both functions complete in a faster time, than the precision of a clock, on my computer, which implies that such small deviations could be written off to random nature of process scheduler. Increase the test samples until it takes, at least, seconds to complete. Then try to compare performances. — Algirdas Preidžius, Mar 06 '17 at 16:15
@Mike These are quite simple function, looking at the assembly is not that hard. After looking at the assembly, I would said that the second function is much easier to optimize for a compiler - you get a branchless function in the end, with only arithmetic operations. — Holt, Mar 06 '17 at 16:19
@FrançoisAndrieux That answer seems definitively the answer to this question, I'l read it thoughtfully to better understand the why! — Mike, Mar 06 '17 at 16:21
Not only integer arithmetic is fast, but, at least in this case, it's great for the pipelines. The three initial calculations are completely independent, which means that they can be computed in parallel before being joined by the comparisons later. — Matteo Italia, Mar 06 '17 at 16:23
@AlgirdasPreidžius It's possible that your clock resolution is not high enough, could you maybe try with std::chrono:microseconds to check if it produces a non-zero result? — Mike, Mar 06 '17 at 16:23
@Mike 1) `std::chrono::nanoseconds` does return non-zero result. It's that values stored in all `t1`/`t2`/`t3`/`t4` are equal. 2) As I already mentioned - all such small timeframes can be written off as random noise due to scheduler. Increase the size of the test case, until total runtime is in the seconds, and then compare the results. — Algirdas Preidžius, Mar 06 '17 at 16:28
@AlgirdasPreidžius Thanks a lot, will do that and will also check the assembly output of the testing code! — Mike, Mar 06 '17 at 16:31
I am not at all convinced that the first function is correct! If `isAllowed(10:01, 12:59, 11:30)` returns `true`, whereas the second one returns `false`. (And the correctness of the second one can be established by inspection.) — Martin Bonner supports Monica, Mar 06 '17 at 17:19

score 11 · Accepted Answer · edited May 23 '17 at 12:17

11

There is no golden rule for this. Unfortunately, the performance of code like this is notoriously hard to predict. The most important thing to take away from that is

Measure everything!

Now to what's going on in your code: As others noted correctly, we can observe that isAllowed gets compiled to a function using branches, while isAllowed2 ends up being branchless.

Branches are interesting when talking about performance: They are somewhere between literally free and ridiculously expensive, inclusively. This is due to a CPU component called the branch predictor. It tries to predict which branch your control flow will take and makes the CPU speculatively execute it. If it guesses right, the branch is free. If it guesses wrong, the branch is expensive. A great and detailed explanation of that concept, including some numbers, can be found in this answer.

So now we need to decide whether we want the branching or the branchless version. In general, neither need be faster than the other! It really depends on how well your target CPUs can predict the branches, which of course depends on the actual input. (Choosing whether to compile a function to a branching or a branchless result is thus a hard problem for compilers as they don't know what CPUs the binary will be run on, nor what kind of input data to expect. See for example this blogpost.)

So if your benchmark was actually correct^†, we have determined that on your CPU the branches are too hard to predict to beat the relatively cheap integer arithmetic. This may also be due to the tiny amount of test cases, the branch predictor cannot learn a pattern from such few invocations. But again, we cannot just call that one way or the other, we have to look at the actual performance in the specific case.

^†As noted in the comments, the execution time is somewhat short for a good measurement, I see huge deviations on my machine. For information about micro benchmarking you can have a look at this talk, it's harder than one might think.

Also, as Martin Bonner helpfully noticed, your two functions don't do the same thing, you'd have to fix that for a correct benchmark of course.

edited May 23 '17 at 12:17

Community

1
1

answered Mar 06 '17 at 16:43

Baum mit Augen

49,044
25
144
182

This is definitively the answer I'm looking for, I'd like to take the opportunity to add a quick and small related questions then: On optimized code the final result may differ from compiler to compiler, this means that the final program may perform much better if compiled with a certain compiler vs another, correct? – Mike Mar 06 '17 at 23:41
1

@Mike Yes, that *can* happen. But on all of the big compilers, the resulting binaries are similarly fast most of the time, big differences are mostly to be expected if your code has some small, but very performance critical part. On big programs, the differences usually kind of even out, +/- a couple percent. – Baum mit Augen Mar 06 '17 at 23:55
branch prediction simply starts a fetch a few to several clocks early fetches can take on a good day that many clocks (cache hit) or bad day (catche miss) hundreds of clocks.So branch prediction does not solve the cost of branching it just makes it slightly less painful. In general, branches are slow and not branching is faster, but a lot also depends on memory accesses by those instructions either way 100 register only non-branching instructions can be faster than a few non branching memory access based instructions.The messing is still completely correct, measure everything and cant predict. – old_timer Mar 14 '17 at 13:29
@old_timer Cache hits and branch prediction are mostly uncorrelated (right word I hope, but anyways, neither correct nor incorrect branch prediction implies a cache miss in general). The branch predictor speculatively executes stuff, and for real world work load, >>90% hit rate is not uncommon. It really can go both ways. So *"In general, branches are slow and not branching is faster"* no, *especially* if it avoids expensive instructions like `log`, `exp`, several floating point operations or integer divisions. Blanket statements like that absolutely do not work. – Baum mit Augen Mar 14 '17 at 21:35
the point is a branch predictor can only see what is in the pipe, the size of the pipe determines how far ahead it could theoretically predict. and if the pipe is moving smoothly, then it is that many clocks is how far ahead it can start the fetch (compared to what it would without the predictor) so it could start fetching a handful of clocks ahead, saving only a handful of clocks, if some of the instructions in the pipe are slow, memory accesses, then it might be dozens or hundreds ahead sure, but the pipe is moving slowing so what did you really save. – old_timer Mar 14 '17 at 22:03
my point is it is not this magic thing that makes huge gains on performance it is yet another thing that makes incremental gains on performance. – old_timer Mar 14 '17 at 22:04
cache is involved in that where the branches land in the cache line how the fetch and its buffer works to feed the pipe affect how full the pipe is if the pipe stalls waiting on a cache line, etc there is this herky jerky motion and you can tune branches to make branch prediction look good and can tune them to make them look bad for demonstration purposes of how branch prediction helps or not or not so much. – old_timer Mar 14 '17 at 22:06
*"the point is a branch predictor can only see what is in the pipe"* The big high-end CPUs I target seem to be able to learn patterns from past results quite decently, but let's not get into a discussion about detailed working of branch prediction in the comments here. *"you can tune branches to make branch prediction look good and can tune them to make them look bad"* Exactly, and in the real world, it's somewhere in between. It is faster than some stuff and slower than other stuff, depending on CPU, actual data used etc. So we measure. – Baum mit Augen Mar 14 '17 at 22:10

score 2 · Answer 2 · edited Mar 07 '17 at 08:58

Because you are not measuring what you want to measure.

In fact to execute your two functions takes around 40ns on my computer, but if I use your test code I get a result of the order of 500ns.

You are not performing the measurment you want because: 1. The time to execute only once these functions is of the same order (even smaller) than the execution time of the function that actually get the clock. As a rule on the thumb, to test, try to measure time that are larger than 10ms. 2. Between the two ticks the compiler has placed an aggressively inlined and optimized versions of your functions because it knows what are the inputs, which is probably what will not happen in the real case.

If you put the definition of your two functions in a different file than the file where are defined your inputs:

//is_allowed.cpp
struct timing {
    short hour;
    short minute;
};
bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

static long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

And then execute a million of time your functions between the "ticks", you will get a much more reliable result:

int main(){
//...

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

//            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
//                << (ergValues[i][j] ? receive : notReceive) << std::endl;
//...
    }

Surprise, I get:

    totalNs1=38085793  //(38ms)
    totalNs2=52182920  //(52ms)

While with your code with the exact same computer and compiler I got:

    totalNs1 = 927
    totalNs2 = 587

As you expected the first version of isAllowed is actually the winner!

We get the exact opposite result by just reordering the function calls in a random fashion (totalNs1 = 507092140; totalNs2 = 394444324 on my machine). As I mentioned in my answer, their benchmark is indeed dubious, but fixing that is a little bit more involved than just putting the functions in another file and calling them more often with the same input, even if we ignore the fact that the two functions don't do the same thing. Micro-benchmarking correctly is not *that* trivial. — Baum mit Augen, Mar 06 '17 at 20:13
@BaummitAugen I wonder how your test result as the one given in the question can give such large time periods? The time needed to execute one of this function shall be of the order of 20 ns, definetively not hundreds of ns. — Oliv, Mar 06 '17 at 20:38
@BaummitAugen I just reread your comment... You have performed randomly ordered calls so to get a measure of the total time spent in each of the two functions you must have measured each call? — Oliv, Mar 06 '17 at 20:52
You are using parameter set A 1M times, then parameter set B 1M time, etc. This yields a maximally optimistic pattern for the branch predictor, creating bias towards `isAllowed`. I did the exact same calls as you did, but in a random order. That's maximally pessimistic for the branch predictor, it will fail very often, creating bias against `isAllowed`. Note that I'm not saying one is faster than the other, I certainly imagine `isAllowed` being faster in the end. I'm just saying that a) we cannot just tell by looking at the code and b) we need to be careful when doing micro-benchmarks. — Baum mit Augen, Mar 06 '17 at 21:01
To emphasize: My point is simply *"Don't just expect stuff, measure."* And before you get deeper into this benchmarking business in this specific case, you should make sure the functions you compare do the same thing. Currently, they don't. — Baum mit Augen, Mar 06 '17 at 21:04
I'll not go deeper on this benchmarking, it is not intellectualy stimulating, I I just pointed out that Mike was not performing the benchmark he was expecting to do. He was benchmarking execution of the function now(), which is neither precise nor reliable even if it has high resolution! — Oliv, Mar 06 '17 at 22:00
Yes, I agree that the original benchmark was deeply flawed. In hindsight, my downvote feels somewhat premature, if you edit your answer, I'll remove it. — Baum mit Augen, Mar 06 '17 at 23:52
I down voted because I consider that "Measure everything" and then giving a theoretical explanation for the observed difference of efficiency not really consistent. I am an experimentalist scientific, but stackoverflow is not the appropriate field for the fight against considerations. — Oliv, Mar 07 '17 at 08:48
"Giving a theoretical explanation for the observed difference of efficiency" was not my goal. Instead, I tried to explain why one cannot predict performance easily by pointing out one hard to predict factor (the branch predictor) that is relevant here. I guess I should have made that clearer. — Baum mit Augen, Mar 07 '17 at 13:16
My interpretation of your text is without any doubt not what you meant! — Oliv, Mar 07 '17 at 15:08

How can several comparisons be slower than some calculations?

EDIT

2 Answers2

Measure everything!