Why is do...while slower than while when using gcc optimization

Question

I was playing around with optimization for C programs and used some code for (naively) calculating primes:

#include <stdio.h>

int is_prime(int x) {
    int divisor = 2;

    if(x <= 1)
        return(0);
    if(x == 2)
        return(1);
    while(divisor * divisor <= x) {
        if(x % divisor ==  0)
            return(0);
        divisor++;
    }
    return(1);
}

int main(void) {
    for(int i = 0; i <= 10000000; i++)
        if(is_prime(i))
            printf("%d is a prime number.\n", i);
}

I thought I could boost the performance a little bit by using a do...while loop instead of the while loop, so that the while condition will never be executed for an even x. So I changed the is_prime() function to this:

int is_prime(int x) {
    int divisor = 2;

    if(x <= 1)
        return(0);
    if(x == 2)
        return(1);
    do {
        if(x % divisor ==  0)
            return(0);
        divisor++;
    } while(divisor * divisor <= x);
    return(1);
}

When I compile both version without optimizations (gcc main.c -o main) time ./main > /dev/null takes about 5.5s for both, where it looks like the do..while version performs a tiny bit better (maybe ~40ms).

When I optimize (gcc main.c -O3 -o main), I get a strong difference between the two. time ./main > /dev/null gives me ~5.4s for the do...while version and ~4.9s for the while version.

How can this be explained? Is the gcc compiler just not as well optimized for do..while loops? Should I thus always use while loops instead?

My CPU is an Intel i5-4300M.

My gcc version is gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0.

I have now tested the clang compiler (clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final); clang main.c -O3 -o main) as well. Here the results are as expected: time ./main > /dev/null takes about 6.5s for the do...while version and 6.8s for the while version.

Just checking, are these results repeatable? E.g. if you run each version of the program 10 times, do you find that the time for the `while` version is _consistently_ less than the time for the `do...while` version? — David Z, Oct 27 '18 at 11:32
Yes, they are repeatable. The fluctuation in the results is <50ms. — J. Doe, Oct 27 '18 at 11:35
OK thanks J. Doe. Incidentally I tried it myself and while I also find consistent results, they're the opposite of yours: for me the `do-while` version is about 7% faster. Of course this could be compiler-dependent (so it would help if you include your compiler version) but is there any chance you might have mixed up two versions of the compiled program? It's not likely, I know, but if you were accidentally running the unoptimized version to get the timing results for the optimized `do-while` version, then everything would make sense. Just trying to rule out the silly stuff. — David Z, Oct 27 '18 at 11:37
I have double checked it again, but I did not switch them up. I'll update the question with my gcc version, thanks for the hint. — J. Doe, Oct 27 '18 at 11:40
I [compiled both functions on godbolt](https://godbolt.org/z/a0-ned). Now interpreting the machine language is out of my league :) — pmg, Oct 27 '18 at 11:46
Nice tool, @pmg, but unfortunately I'm not able to interpret the assembler instructions either. — J. Doe, Oct 27 '18 at 11:57
you can make it faster with `divisor += 2` to skip even divisors, or even [`laststep = 6 - laststep; divisor += laststep;`](https://en.wikipedia.org/wiki/Primality_test#Pseudocode) since a prime divides by 6 results in a remainder of 1 or 5 — phuclv, Oct 27 '18 at 15:56
Related: [Checking if a number is prime in NASM Win64 Assembly](https://codereview.stackexchange.com/q/204902). Looking the commented version in my answer might help you understand the compiler output here. Also, a loop-exit condition based on `n/c > c` is an alternative to actually doing a multiply. The multiply is basically free in parallel with division on modern out-of-order execution x86 CPUs, though. — Peter Cordes, Oct 28 '18 at 16:11
Possible duplicate [Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs](https://stackoverflow.com/q/26907523). — Peter Cordes, Oct 28 '18 at 16:16
Yup, you have a Haswell CPU, so the differences are probably just due to chance of code alignment creating front-end bottlenecks. Marking as a duplicate; we can reopen if you find that adding padding to the asm before assembling the compiler-generated asm doesn't change performance in ways that match the results in that question. (Unfortunately compilers don't optimize for modern CPU architectures in enough detail to avoid minor bottlenecks like this.) — Peter Cordes, Oct 28 '18 at 17:55
Did anybody notice that the do ... while one executes 0.5% fewer multiplies and .04% fewer mods than the while() { } one because the divisor is incremented before the loop condition rather than after. I rewrote it in "bc" and the "do...while" one was quicker there too :) — mevets, Oct 30 '18 at 04:25

Why is do...while slower than while when using gcc optimization

0 Answers0