How do I force the GCC compiler to unroll loops?

Question

I'm writing a C program and want a loop unrolled. I need to force the compiler to unroll a loop where the number of iterations is known at compile time.

I've tried compiling using "-O0 -funroll-all-loops". I've tried using #pragma GCC unroll, I've tried using attribute((optimize("unroll-loops"))). None of it seems to unroll the following test function:


#define STR_LEN 12
int __attribute__((noinline)) func(char *input_str) {
    char* fixed_str = "fixed string";
    for (int i = 0; i < STR_LEN; i++) {
        if (fixed_str[i] != input_str[i]) {
            return -1;
        }
        if (i == STR_LEN - 1) {
            return 0;
        }
    }
    return 0;
}

I've compiled using "-S" to see the assembly code and the loop seems the same.

Edit: I used #pragma unroll and compiled with -O1 and it works.

No other optimization options work with `-O0`. Use at least `-Og` or `-O1` if you want asm that isn't total garbage. — Peter Cordes, May 20 '23 at 09:54
Welcome to SO. Would the unrolled loop actually be quicker? Did the compiler remove the second `if` inside the loop as it is useless? — Gerhardh, May 20 '23 at 11:26
Near duplicate: [What exactly is the difference between various optimization levels in gcc/g++?](https://stackoverflow.com/q/60386091) . And [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) re: making asm that's interesting to look at, including optimization options. — Peter Cordes, May 20 '23 at 11:29
The extra comparison I == STRLEN-1 probably interferes. Pull “return 0” out of the loop. Actually you can remove it completely. — gnasher729, May 20 '23 at 11:33
Are you experimenting with performance differences from various optimization such as loop unrolling, or are you *assuming* the unrolled loop will be faster? x86 tends to be a relatively register-starved architecture that doesn't often benefit from loop unrolling - without a lot of registers there's not many registers to put the values from unrolling the loop into. IME, let the compiler optimize the code as it sees fit, then look at the assembly it outputs for loops that are *known bottlenecks*. If there are unused registers, the loop is a candidate for unrolling. Spills and fills? Split it. — Andrew Henle, May 20 '23 at 11:56
(cont) And then, after making the change you need to profile it again to make sure you actually made it faster. Oh, and all of this has to be done on hardware that's identical to what you're actually going to run it on when it really matters. Even an identical CPU isn't really enough because things like memory bandwidth and memory latency matter. And if you have a NUMA system, actual memory placement matters a lot too. — Andrew Henle, May 20 '23 at 11:59
I'm experimenting with branch predictions. I used -O1 and the loop unrolled. Thank you very much. — alon, May 21 '23 at 07:08
@AndrewHenle: You don't need a lot of architectural registers to unroll this. A sequence of load/compare/branch can reuse the same register every time, and either increment a counter between blocks or have separate jump targets that add the different offsets this load used (or a mix of both). On CPUs with out-of-order exec and register renaming, that single architectural register for multiple load results gets renamed onto different architectural registers with no false dependency (as long as the load uses `movzx eax, byte [rdi + 2]` or whatever, not `mov al, [rdi + 2]`). — Peter Cordes, May 21 '23 at 07:28
Of course, the big performance gains possible for this loop come from checking multiple bytes at once, i.e. vectorizing if not near the end of a page. Like in [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) . Or for just equality, you could even use scalar 8-byte and 4-byte registers, using `xor` and `bsf` / `shr reg,3` to find the position of the difference if not equal. — Peter Cordes, May 21 '23 at 07:31

Luis Colorado · Answer 1 · 2023-05-25T05:14:22.127

As you have been indicated in the comments, you are meshing with the compilation options. But you can never enforce loop unrolling (or any other optimization on some code) if the compiler decides that the loop is not possible to unroll. Your loop seems to be unrollable, but the requirement that the unrolled loop must behave externally (as the user sees it) as it does without the optimization applies, can enforce the code to be conserved as a loop.

BTW, you can try to unroll it yourself. Let's see what happens:

#define STR_LEN 12
int __attribute__((noinline)) func(char *input_str) {
    char* fixed_str = "fixed string";
    for (int i = 0; i < STR_LEN; i++) {
        if (fixed_str[i] != input_str[i]) {
            return -1;
        }
        if (i == STR_LEN - 1) {
            return 0;
        }
    }
    return 0;
}

in the first step, we unroll the loop, as many times as necessary, to eliminate the i loop control variable. I'll substitute also the value of STR_LEN - 1 constant expression so the ifs can be clearly eliminated in the next optimization pass. This is something we can do, because the i variable is not modified inside the loop (think that, just passing it by non-const * reference to a function would allow it to be modified inside the function and automatically blocked us to do this step, or even just declaring it as volatile)

int __attribute__((noinline)) func(char *input_str) {
    char* fixed_str = "fixed string";
    /* i = 0 */
        if (fixed_str[0] != input_str[0]) {
            return -1;
        }
        if (0 == 11) {
            return 0;
        }
    /* i = 1 */
        if (fixed_str[1] != input_str[1]) {
            return -1;
        }
        if (1 == 11) {
            return 0;
        }
    /* i = 2 */
        if (fixed_str[2] != input_str[2]) {
            return -1;
        }
        if (2 == 11) {
            return 0;
        }
    /* i = 3 */
        if (fixed_str[3] != input_str[3]) {
            return -1;
        }
        if (3 == 11) {
            return 0;
        }
    /* i = 4 */
        if (fixed_str[4] != input_str[4]) {
            return -1;
        }
        if (4 == 11) {
            return 0;
        }
    /* i = 5 */
        if (fixed_str[5] != input_str[5]) {
            return -1;
        }
        if (5 == 11) {
            return 0;
        }
    /* i = 6 */
        if (fixed_str[6] != input_str[6]) {
            return -1;
        }
        if (6 == 11) {
            return 0;
        }
    /* i = 7 */
        if (fixed_str[7] != input_str[7]) {
            return -1;
        }
        if (7 == 11) {
            return 0;
        }
    /* i = 8 */
        if (fixed_str[8] != input_str[8]) {
            return -1;
        }
        if (8 == 11) {
            return 0;
        }
    /* i = 9 */
        if (fixed_str[9] != input_str[9]) {
            return -1;
        }
        if (9 == 11) {
            return 0;
        }
    /* i = 10 */
        if (fixed_str[10] != input_str[10]) {
            return -1;
        }
        if (10 == 11) {
            return 0;
        }
    /* i = 11 */
        if (fixed_str[11] != input_str[11]) {
            return -1;
        }
        if (11 == 11) {
            return 0;
        }
    return 0;
}

now, we can eliminate all the if (i == STR_LEN -1) statements that result in a false result, as there's no else part in the if statement.

int __attribute__((noinline)) func(char *input_str) {
    char* fixed_str = "fixed string";
    /* i = 0 */
        if (fixed_str[0] != input_str[0]) {
            return -1;
        }
    /* i = 1 */
        if (fixed_str[1] != input_str[1]) {
            return -1;
        }
    /* i = 2 */
        if (fixed_str[2] != input_str[2]) {
            return -1;
        }
    /* i = 3 */
        if (fixed_str[3] != input_str[3]) {
            return -1;
        }
    /* i = 4 */
        if (fixed_str[4] != input_str[4]) {
            return -1;
        }
    /* i = 5 */
        if (fixed_str[5] != input_str[5]) {
            return -1;
        }
    /* i = 6 */
        if (fixed_str[6] != input_str[6]) {
            return -1;
        }
    /* i = 7 */
        if (fixed_str[7] != input_str[7]) {
            return -1;
        }
    /* i = 8 */
        if (fixed_str[8] != input_str[8]) {
            return -1;
        }
    /* i = 9 */
        if (fixed_str[9] != input_str[9]) {
            return -1;
        }
    /* i = 10 */
        if (fixed_str[10] != input_str[10]) {
            return -1;
        }
    /* i = 11 */
        if (fixed_str[11] != input_str[11]) {
            return -1;
        }
        if (11 == 11) {
            return 0;
        }
    return 0;
}

now, we can eliminate redundant code (the last if (11 == 11) test is not necessary, as it results in the same code executed (return 0;) as if it has not been there from the beginning.
we can organize all tests into a || expression, based on the fact that all of them result in the same code executed (return -1) when the expression avaluates to a true result.

int __attribute__((noinline)) func(char *input_str) {
    char* fixed_str = "fixed string";
    /* i = 0 */
        if (fixed_str[0] != input_str[0]
         || fixed_str[1] != input_str[1]
         || fixed_str[2] != input_str[2]
         || fixed_str[3] != input_str[3]
         || fixed_str[4] != input_str[4]
         || fixed_str[5] != input_str[5]
         || fixed_str[6] != input_str[6]
         || fixed_str[7] != input_str[7]
         || fixed_str[8] != input_str[8]
         || fixed_str[9] != input_str[9]
         || fixed_str[10] != input_str[10]
         || fixed_str[11] != input_str[11]) {
            return -1;
        }
    return 0;
}

there's still another level of optimization, based on the fact that fixed_str is a constant string.

int __attribute__((noinline)) func(char *input_str) {
        if ('f' != input_str[0]
         || 'i' != input_str[1]
         || 'x' != input_str[2]
         || 'e' != input_str[3]
         || 'd' != input_str[4]
         || ' ' != input_str[5]
         || 's' != input_str[6]
         || 't' != input_str[7]
         || 'r' != input_str[8]
         || 'i' != input_str[9]
         || 'n' != input_str[10]
         || 'g' != input_str[11]) {
            return -1;
        }
    return 0;
}

and probably you could reduce the test to just three tests with integers, if you realize that you can make the || expression into a && by negating it:

int __attribute__((noinline)) func(char *input_str) {
        if ('f' == input_str[0]
         && 'i' == input_str[1]
         && 'x' == input_str[2]
         && 'e' == input_str[3]
         && 'd' == input_str[4]
         && ' ' == input_str[5]
         && 's' == input_str[6]
         && 't' == input_str[7]
         && 'r' == input_str[8]
         && 'i' == input_str[9]
         && 'n' == input_str[10]
         && 'g' == input_str[11]) {
            return 0;
        }
    return -1;
}

a final note, I'd allow the compiler to inline this function, as it has become so simple that the call/return overhead becomes significative.

Even simpler would be `return memcmp`. You don't need to booleanize memcmp's int return value into 0 or 1 with a `!= 0` compare, you can just return an `int` that's either non-zero for non-match or zero for a match. (`memcmp` returns negative / zero / positive according to the mismatching byte, if any.) — Peter Cordes, May 24 '23 at 10:37
`memcmp` is actually not strictly equivalent. Unlike `strcmp`, `memcmp(a, b, len)` can access `a[len-1]` and `b[len-1]` even if there was a mismatch in an earlier byte. (It's C undefined behaviour in that case: https://en.cppreference.com/w/c/string/byte/memcmp) `memcmp` would fault in practice for a case where `input_str` pointed to a byte other than `'f` at the end of a page, with the next page no mapped. — Peter Cordes, May 24 '23 at 10:41
Yes, you are correct! even just calling `memcmp(3)` and using the prefix string (which is what I guess the function will be used as, as a string prefix detector for fixed strings) and its length will be far more readable and reusable. — Luis Colorado, May 24 '23 at 10:43
Hmmm, I think `memcmp` should rertun as soon as it knows both memory are different, at the first mismatch, so this discards the case you are considering. As soon as it detects a difference in the bytes, it should return, no need to look after. If the behaviour you comment is because a bug in `memcmp()`'s actual design I'm not aware. — Luis Colorado, May 24 '23 at 10:46
`memcmp` is allowed to use wider loads without checking, up to the limit of `len-1`. I double checked since the cppref wording isn't 100% explicit. I single-stepped the asm for a call to x86-64 glibc's `memcmp("fxxxxxxxx", "aaaaaaaaaa", 8)`, and you do end up with a register holding 8 bytes of `fxxxxxxxx` ASCII data, despite them differing on the first byte. In my hand-written asm caller, the `f` was at the very end of a page, but it only checks for page crossing beyond the specified `len`: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S.html#402 — Peter Cordes, May 24 '23 at 10:56
IMHO, `memcmp` should never go after the lenght specified in it's third parameter in any case... and this is fixed by the constant `STR_LEN`. I'd check the link you post, but you need to convince me, to change the answer. — Luis Colorado, May 24 '23 at 11:08
Hahaha, looking at the code you showed has made me to think about unrolling a loop for later calling a function that has a loop to do the job.... I should start the **Note:** before the substitution of the code by a call to `memcmp()` You have convinced me, but by a different reason :) — Luis Colorado, May 24 '23 at 11:11
`memcmp` doesn't go past the specified length in the 3rd arg. But `STR_LEN` might be longer than `input_str` without having any UB in the original C function, so compilers need to respect that possibility. The original C function will stop shorter `STR_LEN` that in `input_str` on an earlier mismatch, but `memcmp` would fault. Optimizing it into a call to `strcmp` would be safe, since that function is required to work like a char-at-a-time loop without touching bytes past the match. (As least as-if it did that; asm of course knows that memory protection has page granularity.) — Peter Cordes, May 24 '23 at 11:12
Oh, you're right.... but this is the reason I have not used `strncmp()`, as the original code also doesn't check past it.... reverting to `strncmp()` should result in different code, I check all the bytes, even past of a shorter string, as `memcmp()` does... this is how the original loop was conceived. I think you and me are thinking in terms of strings.... and the original code is not using `'\0'` detection at any point. — Luis Colorado, May 24 '23 at 11:15
Yes, a human looking at how the function was designed can make the reasonable decision to change it to `memcmp` and make it unsafe to pass shorter buffers near the end of a page. That's good, although perhaps worth mentioning as a change in corner-case behaviour. And as a reason why a C compiler *can't* make that optimization for you, because it must not break any legal caller, no matter how weird or silly. — Peter Cordes, May 24 '23 at 11:18
If shorter strings are `0`-terminated, you'll have a mismatch of that zero not matching a byte in `"fixed string"`, so the original C function actually won't read past the ends of short C-strings either. Only if you had something like `char buf[3] = {'f', 'i', 'x'};` would you go past the end after matching all the bytes of the input object. — Peter Cordes, May 24 '23 at 11:19
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253798/discussion-between-luis-colorado-and-peter-cordes). — Luis Colorado, May 24 '23 at 11:40
re: *`int *input_str = (int *)orig_str;`* Please don't suggest code with strict-aliasing undefined behaviour, at least not without mentioning it. Also, you're assuming `int` is `int32_t`? `uint32_t` would be a better choice... or just use `memcmp`; it will inline that way for small convenient sizes like 8+4 - https://godbolt.org/z/53Ebf7s8M shows clang for x86-64 using a 64-bit and 32-bit compare by doing two XORs and ORing those results together to see if there are any non-matching bits. (Also included an 11-byte memcmp: clang does overlapping 8 and 4, gcc overlapping 8 and 8 bytes :/) — Peter Cordes, May 24 '23 at 11:44
You removed the mention of `memcmp` entirely? That was the best part. People should definitely let the compiler inline `memcmp` instead of rolling their own, especially not using `int` with strict-aliasing UB. For correctness and also performance since the optimal strategy varies with different ISAs. 3 separate branches seems unlikely to be optimal. — Peter Cordes, May 24 '23 at 12:03
Yes, because unrolling a loop to use a general function that unrolls it again partially was not a good approach. And more optimizing can be achieved by the 64bit code you showed to me (well, this is architecture dependant, so not just the endiannes `#if` should be used, but also the integer size. — Luis Colorado, May 24 '23 at 12:11
Compilers fully inline `memcmp` for small fixed lengths. As I showed in https://godbolt.org/z/53Ebf7s8M, there's no `call memcmp` in the asm, so the glibc code in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S.html#402 doesn't run. (Unless you compile with optimization disabled, in which case the hand-written asm will do a much better job that anti-optimized debug-build asm.) — Peter Cordes, May 24 '23 at 12:15
I agree on your last comment, but the sample I used was to show how the compiler steps by doing high level optimization (which is something that the programmer can do, but they never do because they have not been taught to) Probably the programmer was not aware of the pattern you suggest, but our discussion in the comments, and the fact that he/she can see the editions to the answer allows him to see all things together. — Luis Colorado, May 24 '23 at 12:16
But even in general, for longer or non-constant lengths, glibc memcmp is still good. IDK what your complaint is. If you follow the path of execution for small lengths, not many instructions run. It branches on the length a couple times, then loads, compares (or XORs if it had to use non-vector loads near the end of a page), and bit-scans for the position of the difference. So it's non-looping constant time for `n` less than 32, regardless of the mismatch position if any. I don't understand your objection to it, apparently based on the fact that for large `n` it uses an unrolled loop. — Peter Cordes, May 24 '23 at 12:17
my last version was three 32bit integer comparisons as the constant happened to be 12, and the characters can be built using constant expressions... three comparisons and everything ok. — Luis Colorado, May 24 '23 at 12:19
Yeah, good for you. `memcmp(input, fixed_string, 12)` inlines to 2 compares with both GCC and clang for x86-64. And your code has strict-aliasing UB, so it's potentially broken if it inlines into code that accesses the same array with another object type. Like old Linux kernel code that had an unsafe hand-rolled `memcpy`: [gcc, strict-aliasing, and horror stories](https://stackoverflow.com/a/2959468) . You might get away with it because other modifications to the array probably use `char*`, which is specifically allowed to alias anything else. — Peter Cordes, May 24 '23 at 12:23
`input_string` is also not guaranteed to be aligned, so your `int*` mis-use will likely fault on MIPS for example. And it can fail in more subtle ways even when compiling for x86 or other ISAs that allow unaligned loads in asm. See [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) . You need `typedef unsigned long unaligned_word __attribute__((aligned(1),may_alias));` or similar. — Peter Cordes, May 24 '23 at 12:26
you're right on this last alignment issue. I'll take it off. — Luis Colorado, May 25 '23 at 05:13

How do I force the GCC compiler to unroll loops?

1 Answers1