C++ constant folding a loop for prime numbers

Question

Having had a look at previous questions 1, 2 , I was wondering if I can force the compiler to perform a constant folding for the following code which prints prime numbers.

#include <iostream>

using namespace std;

inline bool is_prime(int n)
{
    if(n<2)
        return false;
    for(int i=2;i*i<=n;i++)
        if(n%i==0)
            return false;
    return true;
}

int main()
{
    for(int i=0;i<20;i++)
        if(is_prime(i))
            cout<<i<<endl;
    return 0;
}

And I build it via:

g++ -O3 -S main.cpp -o main.asm

As the result are a few:

2,3,5,7,11,13,17,19

I would like to force compiler look at the code similar to

for(int x:{2,3,5,7,11,13,17,19})
    cout<<x<<endl;

or

cout<<  2 <<endl;
cout<<  3 <<endl;
cout<<  5 <<endl;
cout<<  7 <<endl;
cout<< 11 <<endl;
cout<< 13 <<endl;
cout<< 17 <<endl;
cout<< 19 <<endl;

But reading the assembly shows that none happens.

I even used __builtin_expect but it didn't work.

Is there any way to force the compiler optimizer to read the for loop and use the advantage that the output data is known?

I would like to do it without using template meta-programming.

PS. My real aim is just testing the compiler rather than an efficient method to calculate prime numbers. I just want to show off to my friends how much C++ compiler is powerful.

If separation of is_prime is matter of concern, I put everything inside the main and no difference was observed:

#include <iostream>

using namespace std;

int main()
{
    for(int n=2;n<20;n++)
    {
        bool is_prime=true;
        for(int i=2;i*i<=n;i++)
            if(n%i==0)
            {
                is_prime=false;
                break;
            }
        if(is_prime)
            cout<<n<<endl;
    }
    return 0;
}

There is even a further example that remains no excuse for the compiler:

#include <iostream>
#include <vector>

using namespace std;

int prime_after6000()
{
    int n=6000;
    do
    {
        bool is_prime=true;
        for(int i=2;i*i<=n;i++)
            if(n%i==0)
            {
                is_prime=false;
                break;
            }
        if(is_prime)
            return n;
        n++;
    }while(true);
}

int main()
{
    cout<<prime_after6000()<<endl;
    return 0;
}

assembly:

...
main:
.LFB1907:
    .cfi_startproc
    subq    $8, %rsp
    .cfi_def_cfa_offset 16
    movl    $6000, %esi   ;;;;;;;;;;;;;;;;;;;; bad
.L18:
    testb   $1, %sil
    je  .L15
    movl    $2, %ecx
    jmp .L16
    .p2align 4,,10
    .p2align 3
.L17:
    movl    %esi, %eax
    cltd
    idivl   %ecx
    testl   %edx, %edx
    je  .L15
.L16:
    addl    $1, %ecx
    movl    %ecx, %eax
    imull   %ecx, %eax
    cmpl    %esi, %eax
    jle .L17
    movl    $_ZSt4cout, %edi
    call    _ZNSolsEi
    movq    %rax, %rdi
    call    _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_
    xorl    %eax, %eax
    addq    $8, %rsp
    .cfi_remember_state
    .cfi_def_cfa_offset 8
    ret
    .p2align 4,,10
    .p2align 3
.L15:
    .cfi_restore_state
    addl    $1, %esi
    jmp .L18
    .cfi_endproc
.LFE1907:
    .size   main, .-main
    .p2align 4,,15
    .type   _GLOBAL__sub_I__Z15prime_after6000v, @function
_GLOBAL__sub_I__Z15prime_after6000v:
...

Note that you redo a significant amount of work every time you call `is_prime`. You could call `is_prime(19);`, cache all the primes found, and just look in the cache for subsequent calls. Search term: memoization. — user4581301, Feb 17 '18 at 08:34
I assume it's the loop inside `is_prime` you want to unroll? Unfortunately that's *impossible* (with the code you currently show). The C++ compiler only sees the current [*translation unit*](https://en.wikipedia.org/wiki/Translation_unit_(programming)). It doesn't know if any other *translation units* will be linked that calls the `is_prime` function. It might be possible to do some more optimizations if you make the function `inline` or `static`, or placed in an anonymous namespace, since then the function can't be called from other translation units. — Some programmer dude, Feb 17 '18 at 08:34
@user4581301, this is just an example to see if a compiler can do something smart rather than I do. — ar2015, Feb 17 '18 at 08:35
Also investigate prime number sieving algorithms. They are brutally fast, but generally limited in size. For<20, this is not an issue. — user4581301, Feb 17 '18 at 08:36
@Someprogrammerdude, I tested the inline keyword. It does not make any change. — ar2015, Feb 17 '18 at 08:37
The compiler may unroll the `main` loop, but it might also see no point if doing so blows the instruction cache or gains little compared to the weight of the prime number seeking. — user4581301, Feb 17 '18 at 08:39
@user4581301, do you mean I should test it on big numbers? I just tested `6000000` it did not work as well. — ar2015, Feb 17 '18 at 08:40
@Someprogrammerdude Sorry, I don't understand why other translation units linking `is_prime()` hinders unrolling. — JFMR, Feb 17 '18 at 08:44
@水飲み鳥 The currently show TU calls `is_prime` with at most `19` as argument. What if another TU calls it with `423`? How could the compiler unroll the loop *inside* `is_prime` to accommodate both calls (one of which the compiler knows nothing about)? — Some programmer dude, Feb 17 '18 at 08:47
@Someprogrammerdude Sorry, I was wrong and I get it now, you were referring to the **loop inside** `is_prime`. Thanks for the more detailed explanation. — JFMR, Feb 17 '18 at 08:51
No to the large numbers. What I mean is the cost of doing the work in the loop may dwarf the cost of the loop. At that point why bother optimizing out the loop overhead? It barely shows up. In addition, the cost of loading the instructions repeated by unrolling the loop into the instruction cache could exceed any savings from the unrolling. — user4581301, Feb 17 '18 at 09:15
@Someprogrammerdude the function is `inline`, so other TUs would get their own copy. It is also actually inlined in `main`, so this particular copy is not visible outside of `main`. — n. m. could be an AI, Feb 17 '18 at 09:21
@n.m. The OP have edited the question somewhat since my first comments. — Some programmer dude, Feb 17 '18 at 09:23
You may succeed in doing this optimization by changing the optimization parameters on GCC: --param max-unroll-times=100, ... See the [doc](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) (There is a long list of optimization parameters that limit unrolling, this one alone may not be sufficient). — Oliv, Feb 17 '18 at 12:11
@Oliv, I tried it. It didnt't work unfortunately. `--param max-unroll-times=10000` — ar2015, Feb 17 '18 at 12:14
When GCC will unrool the loop, it will fall on (many) other limits. Unfortunatly I don't know more on the subject. — Oliv, Feb 17 '18 at 12:17
Given enough information, which isn't present in this case due to separate completion as noted above, really good optimizers have been able to do similar things to what you describe for several decades: indeed I was taught about one in 1971; but this is not constant folding. — user207421, Feb 18 '18 at 23:48
@EJP, What is the correct term to replace `constant folding`? Also isn't `gcc` representing a modern compiler? — ar2015, Feb 19 '18 at 00:20

score 2 · Answer 1 · answered Feb 18 '18 at 18:23

There's a fundamental misunderstanding of compilers here. Let's examine the program you've written very carefully and think about what you're expecting from the compiler to do for you.

The main characteristic of the program is that it doesn't take any input, but it emits output by writing to cout. Keep in mind that the is_prime function is not a compiler intrinsic; the compiler treats it as just another function. This is important and I'll come back to it later.

Now how would the compiler transform the program the way you described? How can it do something like that? That is, how can the compiler transform those two nested loops to a simple sequence of statements that write integers to cout? The only way it can possibly do that is by executing the program to figure out all the values that needs to be written to cout.

That doesn't make any sense, does it? Let's look at the big picture here and consider all programs (or sequences of statements) that have the same characteristic; those that do not take any input but emit output. The question would become: Why doesn't the compiler execute the source code and just emit code that writes the output values? Because of the following reasons:

What if the program takes too much time to execute? What if there was a bug in it that makes it run for an unexpected amount of time? There is even no guarantee that the program will ever halt. What's the compiler supposed to do exactly?
Ultimately, the purpose of the compiler is not to execute the source code, but to emit a functionally equivalent native code that is potentially optimized. After all, if the program does not take any input (like your code), you could just, as easily, compile the code and run it once to see the output. The code has to be executed anyway, by the compiler or by running the executable binary, and it will take the same amount of time either way. In both cases, the code has to be compiled and executed. So such optimization adds no real value. However, this is in contrast to templates, which are intended to be reduced to regular code by the compiler at compile-time. In addition, interpretation would be a better execution model for such programs. You don't want to bother with compiling code? Go ahead and use a Python interpreter or whatever interpreter.
Implementing such optimization can be difficult or impossible in general. What if the mechanism used to emit output had side-effects that can change future output values? The compiler does not exactly know what happens when you write to cout. The standard output stream can be redirected to something weird. So programs that don't take any input aren't necessarily easier for the compiler to optimize.

That said, simple pieces of code that can be evaluated in a very limited amount of time are indeed evaluated by the compiler. This optimization is called constant folding. Pieces of code that don't have any impact on the state of the program can be eliminated without executing them. For example, if you removed cout<<i<<endl;, the compiler will just optimize away the rest of the code. This is called dead code elimination. Compilers do these optimizations because they can be done by the compiler in a systematic way and because they are very effective on real codebases.

But what happens if the is_prime function was a compiler intrinsic? In this case, the compiler would probably have a built-in table of commonly used prime numbers and a very fast implementation of primality testing. You can then expect from the compiler to unroll the loop in the main function several times, maybe even fully, containing only output statements, essentially performing the transformation you're looking for.

such a fixation on old, 80s (60s?) thinking. what's wrong with speculative execution? what's wrong with a "compiler" constantly running in the background trying to improve the executable? why limit the time allotted for the compiler, so that it can do its job super fast, and then what, our computer would just sit idle for hours before we run the exe? is that reasonable? it was, in the past. isn't it time to change this? why think in terms of executables at all and not the running living gradually self-recompiling computing systems? standard compilation is such an outdated paradigm! /ranting — Will Ness, Feb 18 '18 at 21:06
[cf](https://groups.google.com/forum/#!search/Anton$20van$20Straaten$20group$3Acomp.lang.scheme$20primes/comp.lang.scheme/RwNeCn17NKs/mGRXWfVeUn0J) — Will Ness, Feb 18 '18 at 21:10
@WillNess How are speculative execution and recompilation pertinent here exactly? These are indeed very useful, but in other scenarios, I don't see the connection here. Also, these techniques are at least two decades old. Being old doesn't mean they're bad. Fine, the C/C++ compilation model is from the 80s, so what? Moreover, the lazy prime stream implementation in Schism you referred to is really just an example of a compiler intrinsic, which I discussed in my answer already. — Hadi Brais, Feb 18 '18 at 21:51
I was under the impression, from that article, that this ["Schism"](http://web.archive.org/web/20010630015434/http://www.irisa.fr:80/lande/schism.html) was able to transform any given program that way, for *any* user function - just analyze the call site and replace it by pre-calculating as much of the result as possible. In terms of this Q&A, it didn't replace the function, but each call to it. a maximal "inlining". — Will Ness, Feb 18 '18 at 22:06
If the function is called many times, then the compile time constant folding is justified. Is there a list of what is exactly computed in compile time? For example I tried and know that a linear sum is constant folded: `for(int sum=0,i=0;i<100;i++) sum+=i;` — ar2015, Feb 18 '18 at 23:53
@ar2015 The compiler optimizes this loop because the compiler deemed that the number of iterations is too small and that its body is too small and simple as well, so a combination of loop unrolling and constant folding did the job. Go ahead and try `for(sum=0,i=0;i<1000000;i++) sum+=i;` and see what happens. The number of iterations is too large. The only way for the compiler to optimize this is to have an *intrinsic* understanding of the loop and use a mathematical formula for the sum instead of wasting time doing loop unrolling and constant folding. As another example, make n=1 in the code. — Hadi Brais, Feb 19 '18 at 00:28
@ar2015 I mean in the last version of the primality testing code. It becomes simple enough for the compiler to figure it out. As I said in my answer, ultimately, the compiler's job is not to execute your code, but to compile it. These are different objectives and the compiler draws a firm line between execution and compilation. — Hadi Brais, Feb 19 '18 at 00:32
@HadiBrais, Do you think prime numbers up to `18` are too much complicated for the compiler to pre-calculate? — ar2015, Feb 19 '18 at 00:43
@ar2015 To the compiler, yes. There are two nested loops, one conditional statement in the outer loop and another in the inner loop, the inner loop uses the remainder operator and a multiplication (in the loop header), and the outer loop has an unbounded number of iterations due to the `true` condition. That's a LOT of work. As long as the inner loop may iterate more than once, that's a lot of work of course. — Hadi Brais, Feb 19 '18 at 00:53

C++ constant folding a loop for prime numbers

1 Answers1

Linked