GCC optimization flag -O2 makes code much slower that -O0

Question

Here is my code.

foo() is simple. It compares the argument to some strings one by one.

main() is a little complex. It just call foo() with different strings and time it, 3 times.

#include <string.h>
#include <time.h>
#include <stdio.h>
int foo(const char *s)
{
    int r = 0;
    if (strcmp(s, "a11111111") == 0) {
        r = 1;
    } else if (strcmp(s, "b11111111") == 0) {
        r = 2;
    } else if (strcmp(s, "c11111111") == 0) {
        r = 3;
    } else if (strcmp(s, "d11111111") == 0) {
        r = 4;
    } else if (strcmp(s, "e11111111") == 0) {
        r = 5;
    } else if (strcmp(s, "f11111111") == 0) {
        r = 6;
    } else if (strcmp(s, "g11111111") == 0) {
        r = 7;
    }
    return r;
}

int main()
{
    char *ss[] = {
        "a11111111",
        "b11111111",
        "c11111111",
        "d11111111",
        "e11111111",
        "f11111111",
        "g11111111",
        "h11111111",
    };

    int r = 0;
    for (int i = 0; i < 3; i++) {
        clock_t begin = clock();
        for (int j = 0; j < 100000; j++) {
            r = foo(ss[r]);
        }
        printf("%lu\n", clock() - begin);
    }

    return r;
}

Here is the result:

$ gcc tmp.c
$ ./a.out
3979
4025
4272
$
$ gcc tmp.c -O2
$ ./a.out
20575
14572
13841
$

As you can see, -O2 makes the code much slower. Why?

GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)

Thanks in advance,

Wu

==== EDIT ====

Thanks @Ville Krumlinde. I make another code to test strcmp only. There are 2 files:

//// main.c
#include <time.h>
#include <stdio.h>

int x_strcmp(const char *a);

int main()
{
    int r = 0;
    for (int i = 0; i < 3; i++) {
        clock_t begin = clock();
        for (int j = 0; j < 100000; j++) {
            r += x_strcmp("a111111111");
        }
        printf("%lu\n", clock() - begin);
    }

    return r;
}

//// tmps.c
#include <string.h>
int x_strcmp(const char *a)
{
    return strcmp(a, "a11111111");
}

The result:

$ gcc main.c -O2 -c
$
$ gcc tmps.c -c && cc main.o tmps.o -o a.out
$ ./a.out
870
896
857
$
$ gcc tmps.c -c -O2 && cc main.o tmps.o -o a.out
$ ./a.out
6290
5191
6382
$

The only difference is with or without -O2 when compiling tmps.c. And the flag still makes the code much slower.

Here are the assembly code:

#without -O2
x_strcmp:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movq    -8(%rbp), %rax
        movl    $.LC0, %esi
        movq    %rax, %rdi
        call    strcmp
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

# with -O2
x_strcmp:
.LFB24:
        .cfi_startproc
        movq    %rdi, %rsi
        movl    $10, %ecx
        movl    $.LC0, %edi
        repz cmpsb
        seta    %al
        setb    %dl
        subl    %edx, %eax
        movsbl  %al, %eax
        ret
        .cfi_endproc

[Output here](https://godbolt.org/z/1f64M6). `gcc -O2` inlines all the `strcmp`s as `repz cmpsb` whereas `gcc -O0` leaves them as library calls. Commenting out all the conditional jumps in `foo()` makes it 10 times faster, even though it never gets to skip any comparison. So I suppose it must be some hideous failure of branch prediction, but the `gcc -O0` has just as many conditional branches, so I don't see why it would be so much better. — Nate Eldredge, Mar 11 '21 at 06:47
@NateEldredge Changing to `gcc trunk` makes it no longer generate `repz cmps` and then the speed difference disappear. I'm guessing that calling `strcmp` is more efficient. — Ville Krumlinde, Mar 11 '21 at 10:03
@phuclv: The godbolt link in my comment above has just that. — Nate Eldredge, Mar 11 '21 at 15:42
@VilleKrumlinde: I'm not sure it's that simple. As far as I can tell, it's not the `repz cmps` that's taking most of the time. (On my machine, `strcmp` uses AVX, so it's a couple of 32-byte loads and comparison. Maybe faster than 9 iterations of `cmpsb`, but a factor of 3 would surprise me, especially since there's extra overhead in checking for page boundaries, etc.) — Nate Eldredge, Mar 11 '21 at 15:46
@NateEldredge, I make another code to test `strcmp` only. I append the code to the question. It seems that it's the `repz cmps` that makes the code slow. — Bingzheng Wu, Mar 12 '21 at 02:04
@NateEldredge: Basically a duplicate of [Why is this code 6.5x slower with optimizations enabled?](https://stackoverflow.com/q/55563598) - inline expansion of string functions using slow `repz` or `repnz` stuff. (But in that question it was strlen, and the string was growing to huge lengths, so the effect was even more dramatic. And the `-O2` inline expansion for known-aligned buffers was a scalar bithack instead of simple SSE2. The strings in this question are 10 bytes long (including terminator); just barely too long to do with a single qword `cmp` against a register :/) — Peter Cordes, Mar 12 '21 at 07:56
Others have pointed out - the horrible performance of `repz cmpsb`, but also the dependencies on the flags register from the previous operation in @NateEldredge's [link](https://godbolt.org/z/1f64M6); Apart from clobbering `rdi`, `rsi`, and `rcx`, the following sequence: `seta dl; sbb dl, 0; test dl, dl; je .L1` form a dependency chain. These strings only differ by the first character, yielding only 2 iterations of `repz` - it's clear how incredibly *bad* this 'optimized' code is, if an *function call* can beat it! — Brett Hale, Mar 12 '21 at 16:13
Even with a very simple (re-named) `strcmp` replacement, the difference is [amazing](https://godbolt.org/z/MWTxKx). — Brett Hale, Mar 12 '21 at 16:23

GCC optimization flag -O2 makes code much slower that -O0

0 Answers0

Linked