How to prevent GCC from optimizing out a busy wait loop?

Question

I want to write a C code firmware for Atmel AVR microcontrollers. I will compile it using GCC. Also, I want to enable compiler optimizations (-Os or -O2), as I see no reason to not enable them, and they will probably generate a better assembly way faster than writing assembly manually.

But I want a small piece of code not optimized. I want to delay the execution of a function by some time, and thus I wanted to write a do-nothing loop just to waste some time. No need to be precise, just wait some time.

/* How to NOT optimize this, while optimizing other code? */
unsigned char i, j;
j = 0;
while(--j) {
    i = 0;
    while(--i);
}

Since memory access in AVR is a lot slower, I want i and j to be kept in CPU registers.

Update: I just found util/delay.h and util/delay_basic.h from AVR Libc. Although most times it might be a better idea to use those functions, this question remains valid and interesting.

Related questions:

Maybe there's some sort of a "Sleep" syscall? Maybe you can just embed some assembly logic? — George, Aug 16 '11 at 18:57
Why not to insert something like `volatile asm ("rep; nop;")` to busy-pause by wasting CPU cycles that do nothing? — , Aug 16 '11 at 18:59
Why not putting this piece of code in a function and compiling it with '-O0' separately from the rest of your '-O2' code? And linking them together obviously. — stnr, Aug 16 '11 at 19:00
@George: Just checked. `avr-libc` has no sleep function that just waits some time. Instead, it maps to CPU `sleep` instruction, which starts one of the low-power modes (effectively stopping the CPU). Good idea, nevertheless. — Denilson Sá Maia, Aug 16 '11 at 19:04
People, you are giving solutions in the comments! Add them as answers! :) — Denilson Sá Maia, Aug 16 '11 at 19:05
@Denilson - mostly because they're not full fledged answers, only suggestions on what to try. — KevinDTimm, Aug 16 '11 at 19:07
@Denilson I asked a question, you answering it will maybel lead to a full length answer :) — stnr, Aug 16 '11 at 19:45
@stavnir Well... The only reason to not do that is the amount of extra work required (adding an extra file, changing the Makefile to compile that file with different flags). But, still, your question is already a good answer. — Denilson Sá Maia, Aug 16 '11 at 19:57
@Vlad: there is no `rep` in AVR instruction set. What was it supposed to be? — Denilson Sá Maia, Aug 16 '11 at 19:57
@Denilson: I don't know AVR assembly, that's why I said "something like". There should be something similar, no? — , Aug 16 '11 at 20:00
@Vlad: that's what I asked. What is `rep`? What assembly language were you thinking about when you wrote that? I can't say if there is anything similar if I don't know what you meant in the first place! ;) — Denilson Sá Maia, Aug 16 '11 at 20:09
@Denilson: REP - repeat, see http://en.wikipedia.org/wiki/X86_instruction_listings BTW, here is how to pause on ARM, see (PAUSE, NOP defines) - http://doc.bertos.org/2.6/attr_8h_source.html — , Aug 16 '11 at 21:03
@Vlad: Thanks, but even on x86, it seems `rep` can't be used together with `nop`. And, no, there is no such thing on AVR. — Denilson Sá Maia, Aug 16 '11 at 21:51
@Denilson: It not only __can__ be used, but is actually used all over the place. I've just posted you an example of alternative instructions for ARM. — , Aug 16 '11 at 22:14

score 91 · Accepted Answer · edited May 23 '17 at 12:17

I developed this answer after following a link from dmckee's answer, but it takes a different approach than his/her answer.

Function Attributes documentation from GCC mentions:

noinline This function attribute prevents a function from being considered for inlining. If the function does not have side-effects, there are optimizations other than inlining that causes function calls to be optimized away, although the function call is live. To keep such calls from being optimized away, put asm ("");

This gave me an interesting idea... Instead of adding a nop instruction at the inner loop, I tried adding an empty assembly code in there, like this:

unsigned char i, j;
j = 0;
while(--j) {
    i = 0;
    while(--i)
        asm("");
}

And it worked! That loop has not been optimized-out, and no extra nop instructions were inserted.

What's more, if you use volatile, gcc will store those variables in RAM and add a bunch of ldd and std to copy them to temporary registers. This approach, on the other hand, doesn't use volatile and generates no such overhead.

Update: If you are compiling code using -ansi or -std, you must replace the asm keyword with __asm__, as described in GCC documentation.

In addition, you can also use __asm__ __volatile__("") if your assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization).

How would you do this in visual studio? `__asm{""}` will not work — woosah, Jul 15 '15 at 23:04
I'm a bit concerned with loop unrolling, I would recommend this version with explicit data dependencies: https://stackoverflow.com/a/58758133/895245 — Ciro Santilli OurBigBook.com, Nov 07 '19 at 23:53

score 34 · Answer 2 · answered Aug 16 '11 at 19:30

34

Declare i and j variables as volatile. This will prevent compiler to optimize code involving these variables.

unsigned volatile char i, j;

answered Aug 16 '11 at 19:30

ks1322

33,961
14
109
164

13

Although this works, it has a side-effect of forcing those variables into memory. Thus, GCC will read and write them on every loop iteration, adding quite a lot of overhead. (anyway, if I want extremely fine-grained control, I should write assembly directly) – Denilson Sá Maia Aug 30 '11 at 03:05
2

@DenilsonSá On the other hand, forcing a memory access will ensure the wait always takes the same time, independently if the value is 16bits encodable or not. – Oswin Apr 15 '16 at 11:46
1

@Oswin, can you please elaborate? What do you mean by "if the value is 16bits encodable or not". What is the "value" you are talking about? And encodable into what? – Denilson Sá Maia Apr 15 '16 at 13:03
1

Just curious, could you mark them as `volatile register` to avoid them getting put in memory? – Mark K Cowan Jun 10 '17 at 16:21
Not a good idea as writing those variables to memory every time adds extra delay and therefore is less fine grained than a loop with a register loop counter, not to mention that it puts stress on the cache / databus while all you want to do is "nothing". – Carlo Wood Sep 28 '19 at 20:38

Ciro Santilli OurBigBook.com · Answer 3 · 2020-11-18T15:18:14.883

Empty __asm__ statements are not enough: better use data dependencies

Like this:

main.c

int main(void) {
    unsigned i;
    for (i = 0; i < 10; i++) {
        __asm__ volatile("" : "+g" (i) : :);

    }
}

Compile and disassemble:

gcc -O3 -ggdb3 -o main.out main.c
gdb -batch -ex 'disas main' main.out

Output:

   0x0000000000001040 <+0>:     xor    %eax,%eax
   0x0000000000001042 <+2>:     nopw   0x0(%rax,%rax,1)
   0x0000000000001048 <+8>:     add    $0x1,%eax
   0x000000000000104b <+11>:    cmp    $0x9,%eax
   0x000000000000104e <+14>:    jbe    0x1048 <main+8>
   0x0000000000001050 <+16>:    xor    %eax,%eax
   0x0000000000001052 <+18>:    retq

I believe that this is robust, because it places an explicit data dependency on the loop variable i as suggested at: Enforcing statement order in C++ and produces the desired loop:

This marks i as an input and output of inline assembly. Then, inline assembly is a black box for GCC, which cannot know how it modifies i, so I think that really can't be optimized away.

If I do the same with an empty __asm__ as in:

bad.c

int main(void) {
    unsigned i;
    for (i = 0; i < 10; i++) {
        __asm__ volatile("");
    }
}

it appears to completely remove the loop and outputs:

   0x0000000000001040 <+0>:     xor    %eax,%eax
   0x0000000000001042 <+2>:     retq

Also note that __asm__("") and __asm__ volatile("") should be the same since there are no output operands: The difference between asm, asm volatile and clobbering memory

What is happening becomes clearer if we replace it with:

__asm__ volatile("nop");

which produces:

   0x0000000000001040 <+0>:     nop
   0x0000000000001041 <+1>:     nop
   0x0000000000001042 <+2>:     nop
   0x0000000000001043 <+3>:     nop
   0x0000000000001044 <+4>:     nop
   0x0000000000001045 <+5>:     nop
   0x0000000000001046 <+6>:     nop
   0x0000000000001047 <+7>:     nop
   0x0000000000001048 <+8>:     nop
   0x0000000000001049 <+9>:     nop
   0x000000000000104a <+10>:    xor    %eax,%eax
   0x000000000000104c <+12>:    retq

So we see that GCC just loop unrolled the nop loop in this case because the loop was small enough.

So, if you rely on an empty __asm__, you would be relying on hard to predict GCC binary size/speed tradeoffs, which if applied optimally, should would always remove the loop for an empty __asm__ volatile(""); which has code size zero.

noinline busy loop function

If the loop size is not known at compile time, full unrolling is not possible, but GCC could still decide to unroll in chunks, which would make your delays inconsistent.

Putting that together with Denilson's answer, a busy loop function could be written as:

void __attribute__ ((noinline)) busy_loop(unsigned max) {
    for (unsigned i = 0; i < max; i++) {
        __asm__ volatile("" : "+g" (i) : :);
    }
}

int main(void) {
    busy_loop(10);
}

which disassembles at:

Dump of assembler code for function busy_loop:
   0x0000000000001140 <+0>:     test   %edi,%edi
   0x0000000000001142 <+2>:     je     0x1157 <busy_loop+23>
   0x0000000000001144 <+4>:     xor    %eax,%eax
   0x0000000000001146 <+6>:     nopw   %cs:0x0(%rax,%rax,1)
   0x0000000000001150 <+16>:    add    $0x1,%eax
   0x0000000000001153 <+19>:    cmp    %eax,%edi
   0x0000000000001155 <+21>:    ja     0x1150 <busy_loop+16>
   0x0000000000001157 <+23>:    retq   
End of assembler dump.
Dump of assembler code for function main:
   0x0000000000001040 <+0>:     mov    $0xa,%edi
   0x0000000000001045 <+5>:     callq  0x1140 <busy_loop>
   0x000000000000104a <+10>:    xor    %eax,%eax
   0x000000000000104c <+12>:    retq   
End of assembler dump.

Here the volatile is was needed to mark the assembly as potentially having side effects, since in this case we have an output variables.

A double loop version could be:

void __attribute__ ((noinline)) busy_loop(unsigned max, unsigned max2) {
    for (unsigned i = 0; i < max2; i++) {
        for (unsigned j = 0; j < max; j++) {
            __asm__ volatile ("" : "+g" (i), "+g" (j) : :);
        }
    }
}

int main(void) {
    busy_loop(10, 10);
}

GitHub upstream.

Related threads:

Tested in Ubuntu 19.04, GCC 8.3.0.

score 8 · Answer 4 · answered Sep 01 '11 at 20:31

I'm not sure why it hasn't been mentioned yet that this approach is completely misguided and easily broken by compiler upgrades, etc. It would make a lot more sense to determine the time value you want to wait until and spin polling the current time until the desired value is exceeded. On x86 you could use rdtsc for this purpose, but the more portable way would be to call clock_gettime (or the variant for your non-POSIX OS) to get the time. Current x86_64 Linux will even avoid the syscall for clock_gettime and use rdtsc internally. Or, if you can handle the cost of a syscall, just use clock_nanosleep to begin with...

score 3 · Answer 5 · edited Aug 19 '20 at 21:53

For me, on GCC 4.7.0, empty asm was optimized away anyways with -O3 (didnt try with -O2). and using a i++ in register or volatile resulted in a big performance penalty (in my case).

What i did was linking with another empty function which the compiler couldnt see when compiling the "main program"

Basically this:

Created "helper.c" with this function declared (empty function)

void donotoptimize(){}

Then compiled gcc helper.c -c -o helper.o and then

while (...) { donotoptimize();}

and link it via gcc my_benchmark.cc helper.o.

This gave me best results (and from my belief, no overhead at all, but can't test because my program won't work without it :) )

I think it should work with icc too. Maybe not if you enable linking optimizations, but with gcc it does.

Could you please provide more details? I linked a "helper.o" with empty function declared to my `benchmark.c`, but gcc still optimize the loop in `benchmark.c`. — axiqia, Jul 29 '21 at 02:47

score 3 · Answer 6 · answered Aug 16 '11 at 19:30

3

I don't know off the top of my head if the avr version of the compiler supports the full set of #pragmas (the interesting ones in the link all date from gcc version 4.4), but that is where you would usually start.

answered Aug 16 '11 at 19:30

dmckee --- ex-moderator kitten

98,632
24
142
234

3

Do you happen to know which GCC option enables/disables optimizing-out that do-nothing loop? I tried using `#pragma GCC optimize 0` (followed by `#pragma GCC reset_options` after the function), but it disabled ALL optimizations (as expected). It would have been better to disable just that one. – Denilson Sá Maia Aug 16 '11 at 20:04
2

pragma only works for subsequently defined functions (and works at the function level). – Foo Bah Aug 16 '11 at 20:21
1

I mean... `optimize 0` was too much, it didn't even store those variables in registers (they were kept in memory). So, if I knew which gcc `-f` option disables the removing of that do-nothing loop, I could disable only that option for that function. That would be great! – Denilson Sá Maia Aug 16 '11 at 22:13

score 1 · Answer 7 · answered Aug 17 '11 at 19:18

1

Putting volatile asm should help. You can read more on this here:-

http://www.nongnu.org/avr-libc/user-manual/optimization.html

If you are working on Windows, you can even try putting the code under pragmas, as explained in detail below:-

https://www.securecoding.cert.org/confluence/display/cplusplus/MSC06-CPP.+Be+aware+of+compiler+optimization+when+dealing+with+sensitive+data

Hope this helps.

answered Aug 17 '11 at 19:18

Groovy

516
5
16

As mentioned in a comment on [this answer](http://stackoverflow.com/a/7083874/1593077), usiing `volatile` has the side effect of forcing these variables into memory, with LD and ST instructions. – einpoklum Jan 14 '16 at 20:20

score 0 · Answer 8 · answered Aug 16 '11 at 22:16

0

put that loop in a separate .c file and do not optimize that one file. Even better write that routine in assembler and call it from C, either way the optimizer wont get involved.

I sometimes do the volatile thing but normally create an asm function that simply returns put a call to that function the optimizer will make the for/while loop tight but it wont optimize it out because it has to make all the calls to the dummy function. The nop answer from Denilson Sá does the same thing but even tighter...

answered Aug 16 '11 at 22:16

old_timer

69,149
8
89
168

Nice idea. Can you show example how to do it? I never understood these makefiles... – Kamil Jan 27 '13 at 01:42
2

@Kamil, a very short explanation, run `gcc -c [your flags here] -o foo.o foo.c` to compile the source into an object, then run `gcc [other flags here] -o foo.elf foo.o bar.o` to link all object files together. Feel free to check the `Makefile`s from my AVR projects: [atmega8-blinking-leds](http://bitbucket.org/denilsonsa/atmega8-blinking-leds/src/tip/Makefile.tpl), [atmega8-hidkeys-helloworld](http://bitbucket.org/denilsonsa/atmega8-hidkeys-helloworld/src/tip/Makefile), [atmega8-magnetometer-usb-mouse](http://bitbucket.org/denilsonsa/atmega8-magnetometer-usb-mouse/src/tip/firmware/Makefile) – Denilson Sá Maia Nov 07 '13 at 19:26
The problem with the compile-separately approach is that it introduces the overhead of a function call and return, which can be quite a bit for the very short wait loops that this sort of thing is useful for. – Donal Fellows Aug 20 '20 at 14:13
the problem with volatile is that it causes the overhead of volatile which can be quite a bit for the very short wait loops that this sort of thing is useful for. The right answer is to just do this in pure asm and not mess with volatile or inline, but since this kind of timing is not generally accurate, definitely not if you are using C, then a little longer or shorter is fine. If you want to be more accurate you have to use dedicated hardware to time things. – old_timer Aug 20 '20 at 15:40

score -1 · Answer 9 · answered Sep 06 '12 at 13:14

-1

You can also use the register keyword. Variables declared with register are stored in CPU registers.

In your case:

register unsigned char i, j;
j = 0;
while(--j) {
    i = 0;
    while(--i);
}

answered Sep 06 '12 at 13:14

Michel Megens

123
2
9

2

Reserving 2 of 32 microcontroller registers for silly loop is bad idea. – Kamil Jan 27 '13 at 01:40

How to prevent GCC from optimizing out a busy wait loop?

9 Answers9

Linked

Related