7

This is the typical question to be DV, so I've hesitated (a lot) before posting it...

I know this question was marked as duplicate, but my tests (if they are good: are they good? this is part of the question) tends to show this is not the case.

At the beginning, I did some tests comparing a for loop to a while loop.

That shows that for loop was better.

But going further, for or while was not the point: the difference is related to:

for (int l = 0; l < loops;l++) {

or

for (int l = 0; l != loops;l++) {

And if you run that (under Windows 10, Visual studio 2017, release), you see that the first one is more than twice faster than the second one.

It is hard (for a novice as I am) to understand if the compiler for some reasons is able to optimize more one or another. But...

  • Short question

Why?

  • Longer question

The complete code is the following:

For the '<' loop:

int forloop_inf(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l < loops;l++) {
        for (int i = 0; i < iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

For the '!=' loop:

int forloop_diff(int loops, int iterations)
{
    int n = 0;
    int x = n;

    for (int l = 0; l != loops;l++) {
        for (int i = 0; i != iterations;i++) {
            n++;
            x += n;
        }
    }

    return x;
}

In both cases, the inner calculation is just here in order to avoid the compiler skipping all the loops.

Respectively this is called by:

printf("for loop inf %f\n", monitor_int(loops, iterations, forloop_inf, &result));
printf("%d\n", result);

and

printf("for loop diff %f\n", monitor_int(loops, iterations, forloop_diff, &result));
printf("%d\n", result);

where loops = 10 * 1000 and iterations = 1000 * 1000.

And where monitor_int is:

double monitor_int(int loops, int iterations, int(*func)(int, int), int *result)
{
    clock_t start = clock();

    *result = func(loops, iterations);

    clock_t stop = clock();

    return (double)(stop - start) / CLOCKS_PER_SEC;
}

The result in seconds is:

for loop inf 2.227 seconds
for loop diff 4.558 seconds

So, even if the interest of all that is relative to the weight of what is done inside the loop compared to the loop itself, why such a difference?

Edit:

You can find here the full source code reviewed so that functions are called in a random order several times.

The corresponding disassembly is here (obtained with dumpbin /DISASM CPerf2.exe).

Running it, I now obtain:

  • '!=' 0.045231 (average on 493 runs)
  • '<' 0.031010 (average on 507 runs)

I do not know how to set O3 in Visual Studio, the compile command line is the following:

/permissive- /Yu"stdafx.h" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\CPerf2.pch" /diagnostics:classic

The code for the loops is above, here is the random way to run it:

typedef int(loop_signature)(int, int);

void loops_compare()
{
    int loops = 1 * 100;
    int iterations = 1000 * 1000;
    int result;

    loop_signature *functions[2] = {
        forloop_diff,
        forloop_inf
    };

    int n_rand = 1000;

    int n[2] = { 0, 0 };
    double cum[2] = { 0.0, 0.0 };

    for (int i = 0; i < n_rand;i++) {
        int pick = rand() % 2;
        loop_signature *fun = functions[pick];

        double time = monitor(loops, iterations, fun, &result);
        n[pick]++;
        cum[pick] += time;
    }

    printf("'!=' %f (%d) / '<' %f (%d)\n", cum[0] / (double)n[0], n[0], cum[1] / (double)n[1], n[1]);
}

and the disassembly (the loop functions only, but not sure it is the good extract of the link above):

?forloop_inf@@YAHHH@Z:
  0000000140001000: 48 83 EC 08        sub         rsp,8
  0000000140001004: 45 33 C0           xor         r8d,r8d
  0000000140001007: 45 33 D2           xor         r10d,r10d
  000000014000100A: 44 8B DA           mov         r11d,edx
  000000014000100D: 85 C9              test        ecx,ecx
  000000014000100F: 7E 6F              jle         0000000140001080
  0000000140001011: 48 89 1C 24        mov         qword ptr [rsp],rbx
  0000000140001015: 8B D9              mov         ebx,ecx
  0000000140001017: 66 0F 1F 84 00 00  nop         word ptr [rax+rax]
                    00 00 00
  0000000140001020: 45 33 C9           xor         r9d,r9d
  0000000140001023: 33 D2              xor         edx,edx
  0000000140001025: 33 C0              xor         eax,eax
  0000000140001027: 41 83 FB 02        cmp         r11d,2
  000000014000102B: 7C 29              jl          0000000140001056
  000000014000102D: 41 8D 43 FE        lea         eax,[r11-2]
  0000000140001031: D1 E8              shr         eax,1
  0000000140001033: FF C0              inc         eax
  0000000140001035: 8B C8              mov         ecx,eax
  0000000140001037: 03 C0              add         eax,eax
  0000000140001039: 0F 1F 80 00 00 00  nop         dword ptr [rax]
                    00
  0000000140001040: 41 FF C1           inc         r9d
  0000000140001043: 83 C2 02           add         edx,2
  0000000140001046: 45 03 C8           add         r9d,r8d
  0000000140001049: 41 03 D0           add         edx,r8d
  000000014000104C: 41 83 C0 02        add         r8d,2
  0000000140001050: 48 83 E9 01        sub         rcx,1
  0000000140001054: 75 EA              jne         0000000140001040
  0000000140001056: 41 3B C3           cmp         eax,r11d
  0000000140001059: 7D 06              jge         0000000140001061
  000000014000105B: 41 FF C2           inc         r10d
  000000014000105E: 45 03 D0           add         r10d,r8d
  0000000140001061: 42 8D 0C 0A        lea         ecx,[rdx+r9]
  0000000140001065: 44 03 D1           add         r10d,ecx
  0000000140001068: 41 8D 48 01        lea         ecx,[r8+1]
  000000014000106C: 41 3B C3           cmp         eax,r11d
  000000014000106F: 41 0F 4D C8        cmovge      ecx,r8d
  0000000140001073: 44 8B C1           mov         r8d,ecx
  0000000140001076: 48 83 EB 01        sub         rbx,1
  000000014000107A: 75 A4              jne         0000000140001020
  000000014000107C: 48 8B 1C 24        mov         rbx,qword ptr [rsp]
  0000000140001080: 41 8B C2           mov         eax,r10d
  0000000140001083: 48 83 C4 08        add         rsp,8
  0000000140001087: C3                 ret
  0000000140001088: CC CC CC CC CC CC CC CC                          ÌÌÌÌÌÌÌÌ
?forloop_diff@@YAHHH@Z:
  0000000140001090: 45 33 C0           xor         r8d,r8d
  0000000140001093: 41 8B C0           mov         eax,r8d
  0000000140001096: 85 C9              test        ecx,ecx
  0000000140001098: 74 28              je          00000001400010C2
  000000014000109A: 44 8B C9           mov         r9d,ecx
  000000014000109D: 0F 1F 00           nop         dword ptr [rax]
  00000001400010A0: 85 D2              test        edx,edx
  00000001400010A2: 74 18              je          00000001400010BC
  00000001400010A4: 8B CA              mov         ecx,edx
  00000001400010A6: 66 66 0F 1F 84 00  nop         word ptr [rax+rax]
                    00 00 00 00
  00000001400010B0: 41 FF C0           inc         r8d
  00000001400010B3: 41 03 C0           add         eax,r8d
  00000001400010B6: 48 83 E9 01        sub         rcx,1
  00000001400010BA: 75 F4              jne         00000001400010B0
  00000001400010BC: 49 83 E9 01        sub         r9,1
  00000001400010C0: 75 DE              jne         00000001400010A0
  00000001400010C2: C3                 ret
00000001400010C3: CC CC CC CC CC CC CC CC CC CC CC CC CC ÌÌÌÌÌÌÌÌÌÌÌÌÌ

Edit again:

What I feel surprising is also the following:

  • In debug more the performance is the same (and the assembly code too)
  • So how to be confident about what you're coding if such differences appear after that? (considering I've done no mistake somewhere)
lemon
  • 336
  • 2
  • 17
  • If somebody told you to in most cases prefer comparison operators to non-identity or identity, it is a sound recommendation. If you understood or actually were told that the reason is speed, then that part is something between a misunderstanding and useless misinformation. – Yunnosch May 29 '18 at 18:59
  • 6
    I have somehow the impression that a look at generated code may clarify many things... – Adriano Repetti May 29 '18 at 19:03
  • 4
    Yes, please show us generated assembly language for both loops. This is really a question about one particular compiler's optimizer doing something unexpected, and we need to see inside the black box to understand why. – zwol May 29 '18 at 19:09
  • 2
    And to answer the question "is < really faster than != in C" the answer is *no*, because it is not faster on *my* computer using *my* compiler ;) – Antti Haapala -- Слава Україні May 29 '18 at 19:10
  • 3
    A full [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) would be better, if only so that we can run the code, and ensure there is nothing else there. Your tests seem to show that the use of `!=` is less efficient than `<`. I don't believe that – at the processor level it is the same test but on different CPU flags. However I am certainly in favour of using `<` over `!=` because it is a better catch, even if you don't expect it to be greater. – Weather Vane May 29 '18 at 19:11
  • I get 0.0 for the inf code and 4.49 for the diff (VS 2015 - Release). Stopwatch says 3.3 for the inf code, but clearly there is a flaw in the testing methodology (although I don't know what it is). I tried QueryPerformanceCounter but got the same results. Disassembly was useless to me (and shockingly long) – zzxyz May 29 '18 at 19:25
  • I get 6.83 sec for both using gcc v4.4.7 with -O2 on an old RH 6 server.. – stark May 29 '18 at 19:30
  • Moved the time test into main which cleared up the 0.0 on my system. Now similar results to OP. 4.5 and 3.2. – zzxyz May 29 '18 at 19:31
  • Are you saying that `loops = 10 * 1000` and `iterations = 1000 * 1000` as nested loops can be done in a few seconds? – Weather Vane May 29 '18 at 19:44
  • 1) Likely `for (int l = 0; l < loops;l++) { for (int i = 0; i < iterations;i++) {` optimizes differently than `for (int l = 0; l != loops;l++) { for (int i = 0; i != iterations;i++) {` 2) Try reversing the test order too. – chux - Reinstate Monica May 29 '18 at 19:46
  • I have implemented both styles of loops alternately, 3 times each with values that take about 2 seconds and can't find any significant difference. – Weather Vane May 29 '18 at 19:48
  • @chux why would it do that? The instructions are essentially the same. – Weather Vane May 29 '18 at 19:48
  • @WeatherVane We could reverse the [question](https://stackoverflow.com/questions/50591174/is-really-faster-than-in-c?noredirect=1#comment88193104_50591174): Why expect the same emitted code when the source code is different? I could see a compiler identifying an optimization of the double `for()` loops that is not seen in the other case. Sure would help if OP posted the emitted code. – chux - Reinstate Monica May 29 '18 at 20:49
  • "At the beginning, I did some tests comparing a for loop to a while loop. That shows that for loop was better." - that already means that you are running some nonsensical tests. – AnT stands with Russia May 29 '18 at 21:00
  • As you researched so hard whether to prefer `while` or `for`, yet another hint (though OT): I would prefer prefix inc./dec. always except there are actual reasons for postfix. [SE: Avoid Postfix Increment Operator](https://softwareengineering.stackexchange.com/a/60002) – Scheff's Cat May 30 '18 at 06:20
  • @AdrianoRepetti, question edited with access to full source + disassembly – lemon Jun 01 '18 at 12:47
  • @chux, question edited. At the end, you'll find links to the complete code and disassembly. Not found O3 mentioned in the answer. – lemon Jun 02 '18 at 08:16
  • @lemon Good to see an improved update (remove DV). Yet to make even better, post the code _here_ and a _minimal_ assembly of the 2 `forloop_xxx()` _here_. – chux - Reinstate Monica Jun 02 '18 at 14:45
  • @chux, done, at least I hope cause concerning the assembly code, not sure it is the (whole) good part... – lemon Jun 02 '18 at 16:43

3 Answers3

11

For proper benchmarking, it's important to run the functions in random order and many times.

typedef int(signature)(int, int);

...

int main() {
    int loops, iterations, runs;

    fprintf(stderr, "Loops: ");
    scanf("%d", &loops);
    fprintf(stderr, "Iterations: ");
    scanf("%d", &iterations);
    fprintf(stderr, "Runs: ");
    scanf("%d", &runs);

    fprintf(stderr, "Running for %d loops and %d iterations %d times.\n", loops, iterations, runs);

    signature *functions[2] = {
        forloop_inf,
        forloop_diff
    };

    int result = functions[0](loops, iterations);
    for( int i = 0; i < runs; i++ ) {
        int pick = rand() % 2;
        signature *function = functions[pick];

        int new_result;
        printf("%d %f\n", pick, monitor_int(loops, iterations, function, &new_result));
        if( result != new_result ) {
            fprintf(stderr, "got %d expected %d\n", new_result, result);
        }
    }
}

Armed with this we can do 1000 runs in random order and find the average times.

It's also important to benchmark with optimizations turned on. Not much point in asking how fast unoptimized code will run. I'll try at -O2 and -O3.

My findings are that with Apple LLVM version 8.0.0 (clang-800.0.42.1) doing 10000 loops and 1000000 iterations at -O2 forloop_inf is indeed about 50% faster than forloop_diff.

forloop_inf: 0.000009
forloop_diff: 0.000014

Looking at the generated assembly code for -O2 with clang -O2 -S -mllvm --x86-asm-syntax=intel test.c I can see many differences between the two implementations. Maybe somebody who knows assembly can tell us why.

But at -O3 the performance difference is no longer discernible.

forloop_inf: 0.000002
forloop_diff: 0.000002

This is because at -O3 they are almost exactly the same. One is using je and one is using jle. That's it.


In conclusion, when benchmarking...

  • Do many runs.
  • Randomize the order.
  • Compile and run as close as you can do how you would in production.
    • In this case that means turning on compiler optimizations.
  • Look at the assembly code.

And most of all.

  • Pick the safest code, not the fastest.

i < max is safer than i != max because it will still terminate if i somehow jumps over max.

As demonstrated, with optimizations turned on, they're both so fast that even not fully optimized they can whip through 10,000,000,000 iterations in 0.000009 seconds. i < max or i != max is unlikely to be a performance bottleneck, rather whatever you're doing 10 billion times is.

But i != max could lead to a bug.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • 1
    Agree with all aside from `i < max` is safer than `i != max`. Neither is safer as each handles a disjoint subset of problems. I'd advocate code for clarity over perceive safe-ness. Otherwise good answer deserving of at least a +1. – chux - Reinstate Monica May 29 '18 at 22:48
  • I've read your answer, thanks. Will come back and edit the question when i'll find some time to make additional tests following your advices. – lemon May 31 '18 at 09:59
  • @lemon It looks like you're getting the same 50% spread as I am with -O2. Every compiler is different, and I don't know Visual Studio, but you could try /Ox. https://msdn.microsoft.com/en-us/library/59a3b321.aspx – Schwern Jun 01 '18 at 18:34
  • No difference at runtime with /Ox... Very strange: in debug the results are equal (and the assembly only differs by one instruction). But how to be confident in your code if you cannot get equal results for things that should be? Anyway, I accept your answer. – lemon Jun 02 '18 at 08:00
2

"<" is not faster than '!='. What's happening is something entirely different.

A loop "for (i = 0; i < n; ++i) is a pattern that the compiler recognises. If the loop body has no instructions modifying i or n, then the compiler knows this is a loop executing exactly max (n - i, 0) times, and can produce optimal code for this.

A loop "for (i = 0; i != n; ++i) is used in practice much less often, so compiler writers are not too bothered with it. And the number of iterations is much harder to determine. If i > n then we have undefined behaviour for signed integers unless there are statements that exit the loop. For unsigned numbers the number of iterations is tricky because it depends on the type of i. You will just get less optimised code.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
1

Always look at the generated code.

It used to be the truth many years ago when some μP did not have some conditional branch instructions or very few flags. So, some of the conditions had to be compiled to set of comparisons and jumps.

But it is not the truth anymore as the modern processors have very rich conditional branching instructions (some of them have many "regular" conditional instructions as well – for example ARM ones) and a lots of the flags.

You can play yourself with different conditions here: https://godbolt.org/g/9DsqJm

Scheff's Cat
  • 19,528
  • 6
  • 28
  • 56
0___________
  • 60,014
  • 4
  • 34
  • 74
  • Thanks for the link. Unfortunately, the generated assembly differs a lot with VS depending on the compile options and does not match (as far I've seen it) with the link. – lemon Jun 02 '18 at 10:12
  • Giving it a closer look, the link seems to be good at indicating the compiler behavior (and its easy to use). Thanks. Though, the following of the question could be: how to be confident about code if compiler gives unexpected (unregular) results? – lemon Jun 02 '18 at 10:45