will gcc optimization remove for loop if it's only one iteration?

Question

Im writing a real time DSP processing library. My intention is to give it a flexibility to define input samples blockSize, while also having best possible performance in case of sample-by-sample processing, that is - single sample block size

I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.

This leads me to a question:

Will gcc compiler optimize this code

int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
 foo()
}

or

//.h
#define BLOCKSIZE 1

//.c
for (volatile int i=0; i<BLOCKSIZE; i++)
{
 foo()
}

to be same as simply calling body of the loop:

foo()

?

Thx

Why did you defined `i` as `volatile`? It does not seems useful. — Jérôme Richard, May 12 '21 at 08:57
no, `volatile int i` prevents this optimization, remove it and it will be optimized, at least with gcc10: https://godbolt.org/z/YWsvGncY9 — mch, May 12 '21 at 09:00

Lundin · Accepted Answer · 2021-05-12T09:42:37.387

4

I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.

No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.

As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.

Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).

x86 example:

for (int i=0; i<5; i++)
{
  foo();
}

gives

    call    foo
    call    foo
    call    foo
    call    foo
    call    foo

But

for (volatile int i=0; i<5; i++)
{
  foo();
}

gives way more inefficient

        mov     DWORD PTR [rsp+12], 0
        mov     eax, DWORD PTR [rsp+12]
        cmp     eax, 4
        jg      .L2
.L3:
        call    foo
        mov     eax, DWORD PTR [rsp+12]
        add     eax, 1
        mov     DWORD PTR [rsp+12], eax
        mov     eax, DWORD PTR [rsp+12]
        cmp     eax, 4
        jle     .L3
.L2:

For further study of the correct use of volatile in embedded systems, please see:

edited May 12 '21 at 09:42

answered May 12 '21 at 09:01

Lundin

195,001
40
254
396

2

@0___________ If the register is defined like for example `volatile uint8_t reg;` then C gives you no other option but do declare the pointer as `volatile uint8_t* ptr = ®`. Whereas `uint8_t* ptr = (uint8_t*)®` is explicitly undefined behavior and a bug. See C17 6.7.3 "If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined." – Lundin May 12 '21 at 09:31
1

And of course not only hardware registers should be volatile. To illustrate the problem see the code in this example: https://godbolt.org/z/q93e43rYs . Without volatile compiler assumes that `counter` is not modified by anything and optimizes it to the dead loop. This is a very common case in the uC programming – 0___________ May 12 '21 at 09:31
1

@0___________ Similarly if you don't define registers through the linker but through macros: `#define reg (*(volatile uint8_t*)0x1234)`. Again the pointed-at data must be set `volatile` or the register will not be treated as such. – Lundin May 12 '21 at 09:33
1

@0___________ If your remark was that we shouldn't volatile qualify the pointer itself `uint8_t* volatile ptr` then I completely agree with you, that's nonsense. I'll update the post so there's no misunderstanding, 1 sec. – Lundin May 12 '21 at 09:36
1

@0___________ Updated. Clearer now? – Lundin May 12 '21 at 09:39
Also **`Only the input/output hardware registers`** it is also wrong. Many objects changed by the interrupt routines or other tasks have to be volatile if there is no direct program execution path tho the code which changes them - as per godbolt.org/z/q93e43rYs – 0___________ May 12 '21 at 09:39
Deleted some comments. But there is one other issue as well – 0___________ May 12 '21 at 09:40
@0___________ I added some links for further study. – Lundin May 12 '21 at 09:43
It can optimize despite of volatile, as the variable is a local scope that is not used anywhere in the loop body. Clang eliminates the loop completely. – Luis Colorado May 14 '21 at 22:01
@LuisColorado No. It is used since it is accessed. It is a side effect and a compiler is not allowed to remove side effects from a program. If clang removed the loop it is not conforming. It would appear that clang in general has plenty of bugs related to loop optimizations, see for example this: https://stackoverflow.com/questions/59925618/how-do-i-make-an-infinite-empty-loop-that-wont-be-optimized-away. – Lundin May 16 '21 at 08:52

score 2 · Answer 2 · answered May 12 '21 at 08:59

2

Since the loop variable is volatile it shouldn't optimize it. The compiler can not know wether i will be 1 when the condition is evaluated, so it has to keep the loop. From the compiler point of view, the loop can run an indeterminite number of times until the condition is satisfied.

If you somehwere access hardware registers, then those should be declared volatile, which would make more sense, to the reader, and also allows the compiler to apply appropriate optimizations where possible.

answered May 12 '21 at 08:59

Devolus

21,661
13
66
113

I'm afraid the compiler doesn't optimize variables based in volatile or not volatile as you say.... just declaring a variable `volatile` by itself doesn't make the compiler to decide on optimizing or not, it's the use of the variable in the code that makes the compiler to optimize or not. See mi answer. – Luis Colorado May 14 '21 at 21:58

score 2 · Answer 3 · answered May 12 '21 at 09:22

volatile keyword says the compiler that the variable is side effects prone - ie it can be changed by something which is not visible for the compiler.

Because of that volatile variables have to read before every use and saved to their permanent storage location after every modification.

In your example the loop cannot be optimized as variable i can be changed during the loop (for example some interrupt routine will change it to zero so the loop will have to be executed again.

Luis Colorado · Answer 4 · 2021-05-14T22:09:22.747

The answer to your question is: If the compiler can determine that every time you enter the loop, it will execute only once, then it can eliminate the loop.

Normally, the optimization phase unrolls the loops, based on how the iterations relate to one another, this makes your (e.g. indefinite) loop to get several times bigger, in exchange to avoid the back loops (that normally result in a bubble in the pipeline, depending on the cpu type) but not too much to lose cache hits.... so it is a bit complicate... but the earnings are huge. But if your loop executes only once, and always, is normally because the test you wrote is always true (a tautology) or always false (impossible fact) and can be eliminated, this makes the jump back unnecessary, and so, there's no loop anymore.

int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
 foo(); // you missed a semicolon here.
}

In your case, the variable is assigned a value, that is never touched anymore, so the first thing the compiler is going to do is to replace all expressions of your variable by the literal you assigned to it. (lacking context I assume blocsize is a local automatic variable that is not changed anywhere else) Your code changes into:

for (volatile int i=0; i<1; i++)
{
    foo();
}

the next is that volatile is not necessary, as its scope is the block body of the loop, where it is not used, so it can be replaced by a sequence of code like the following:

do {
     foo();
} while (0);

hmmm.... this code can be replaced by this code:

    foo();

The compiler analyses each data set analising the graph of dependencies between data and variables.... when a variable is not needed anymore, assigning a value to it is not necessary (if it is not used later in the program or goes out of life), so that code is eliminated. If you make your compiler to compile a for loop frrom 1 to 2^64, and then stop. and you optimize the compilation of that,, you will see you loop being trashed up and will get the false idea that your processor is capable of counting from 1 to 2^64 in less than a second.... but that is not true, 2^64 is still very big number to be counted in less than a second. And that is not a one fixed pass loop like yours.... but the data calculations done in the program are of no use, so the compiler eliminates it.

Just test the following program (in this case it is not a test of a just one pass loop, but 2^64-1 executions):

#include <stdint.h>
#include <stdio.h>
#include <unistd.h>

int main()
{
    uint64_t low = 0UL;
    uint64_t high = ~0UL;

    uint64_t data = 0; // this data is updated in the loop body. 
    printf("counting from %lu to %lu\n", low, high);

    alarm(10); /* security break after 10 seconds */

    for (uint64_t i = low; i < high; i++) {
#if 0
        printf("data = $lu\n", data = i ); // either here... 
#else
        data = i; // or here...
#endif
    }
    return 0;
}

(You can change the #if 0 to #if 1 to see how the optimizer doesn't eliminate the loop when you need to print the results, but you see that the program is essentially the same, except for the call to printf with the result of the assignment)

Just compile it with/without optimization:

$ cc -O0 pru.c -o pru_noopt
$ cc -O2 pru.c -o pru_optim

and then run it under time:

$ time pru_noopt
counting from 0 to 18446744073709551615
Alarm clock

real    0m10,005s
user    0m9,848s
sys     0m0,000s

while running the optimized version gives:

$ time pru_optim
counting from 0 to 18446744073709551615

real    0m0,002s
user    0m0,002s
sys     0m0,002s

(impossible, neither the best computer can count one after the other, upto that number in less than 2 milliseconds) so the loop must have gone somewhere else. You can check from the assembler code. As the updated value of data is not used after assignment, the loop body can be eliminated, so the 2^64-1 executions of it can also be eliminated.

Now add the following line after the loop:

    printf("data = %lu\n", data);

You will see that then, even with the -O3 option, will get the loop untouched, because the value after all the assignments is used after the loop.

(I preferred not to show the assembler code, and remain in high level, but you can have a look at the assembler code and see the actual generated code)

will gcc optimization remove for loop if it's only one iteration?

4 Answers4