Why code with multiple nested loops can finish immediately on GCC but take forever on VS?

Question

long long r = 0;
long long k = 0;
for (; k < 9999999999999; k++) 
{
    for (long long i = 0; i < 9999999999999; i++) 
    {
        for (long long j = 0; j < 9999999999999; j++) 
        {
            r = (r + (i * j) % 100) % 47;
            if (r != 0) 
            {
                r++;
            }
        }
    }
 }

This code executes on i3Core in 0.000001 wall seconds, checked with boost::timer::auto_cpu_timer on i7Core.

But with visual studio 2010 it seems to run in infinite time.

What is wrong with GCC or VS ? Is GCC optimizing too much?

is `r` being ever used? It's probably discarded together with all loops. See assembly. If you want to ensure loop execution, print `r` — Anycorn, Mar 13 '12 at 06:48
Note, the term "super optimization" is typically used for compilers that generate all possible combinations of something and pick the most efficient one. — Lindydancer, Mar 13 '12 at 07:47

score 15 · Accepted Answer · edited May 23 '17 at 12:17

15

Yes, GCC is optimizing that code.

Specifically, it knows that you aren't using the result, so it's removing all of it.
(You're never using the variable r.)

This is called Dead Code Elimination.

To prevent the compiler from optimizing it out, you'll need to use the result somehow. Try printing r out at the end:

cout << r << endl;

However, I warn that you'll need to reduce the iteration counts, or it probably won't finish in your lifetime.

I just tested this in VS2010 x64. Looking at the assembly, it is clear that VS2010 is not able to optimize out the entire loop.

It goes to show that different compilers vary in their ability to optimize different things.

Related, and more in-depth: How does GCC optimize out an unused variable incremented inside a loop?

edited May 23 '17 at 12:17

Community

1
1

answered Mar 13 '12 at 06:47

Mysticial

464,885
45
335
332

3

Visual Studio not being able to optimize this completely is actually a bit scary... I'll test with VS11 Beta and report back. – Xeo Mar 13 '12 at 07:11
Assigning to a `volatile` is also an option if you don't wish to print anything. – edA-qa mort-ora-y Mar 13 '12 at 07:12
1

@edA-qamort-ora-y That would work, though it would also block the compiler from doing mem-to-register promotion - which would probably nuke performance. – Mysticial Mar 13 '12 at 07:14
1

@Mystical, I just mean the final result -- yes, if used in the calculation you'd kill performance. Declare a global `volatile long long junk;` and then assign r to it `junk = r;`. – edA-qa mort-ora-y Mar 13 '12 at 07:18
2

This is kinda depressing. Even after removing the outer two loops VS11 beta is not able to optimize the code out. – Xeo Mar 13 '12 at 07:19
@edA-qamort-ora-y Ah, ic. I didn't think about it that way. :-) – Mysticial Mar 13 '12 at 07:20
2

@Xeo Unless you're writing micro-benchmarks like this, it's very uncommon in actual code to have this much dead-code in such constructs. So I'd probably assume MS didn't find it worth it to implement the dependency analysis needed to prove that this loop is useless. – Mysticial Mar 13 '12 at 07:22
@Xeo: Clang 3.0 seems to only be able to remove one layer of loop, leaving the `i` and `j` loop it does not realizes that `r` is unused either :x – Matthieu M. Mar 13 '12 at 07:45
@Xeo: the problem for MSVC is the branch (the `if`), removing that leads to the loop being fully folded on my system (x64), for 32bit targets, the type of `i` & `j` needs to also be shortened to `long` for it to be folded (I assume this is cause of MSVC's `_all*` functions screwing things up) – Necrolis Mar 13 '12 at 08:03
Thanks for the answers, yes, with cout << r this code makes my computer cry =) – Denis Ermolin Mar 13 '12 at 08:19
An obvious question: you're invoking VS with optimization flags, yes? – Keith Thompson Apr 16 '14 at 17:42
@KeithThompson It's been a long time. But I usually do these things in the default `/O2`. – Mysticial Apr 16 '14 at 17:43
According to [Microsoft's documentation](http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx), `/O2` "optimizes code for maximum speed", and there is no `/O3`. But there' also a `/Ox`, which "selects full optimization", whatever that means. I hesitate to assume that MSVS *can't* eliminate the dead code, but it's likely you're right. – Keith Thompson Apr 16 '14 at 17:48
@KeithThompson It can definitely do dead-code. But this case is probably too complicated for VS2010. Normal DCE will probably miss this case since `r` references itself. But ADCE will get it. ADCE is "aggressive DCE" which assumes that all variables are dead until proven alive. I can't say whether VS2010 does DCE, but not ADCE, but it's certainly a possible explanation. – Mysticial Apr 16 '14 at 17:56
Yes, by "the dead code" I was specifically referring to this example (I could have been clearer about that). – Keith Thompson Apr 16 '14 at 17:57

score 1 · Answer 2 · edited May 23 '17 at 12:02

You are running undefined behaviour

Your code has undefined behavior at i * j on most platforms because:

9999999999999 * 9999999999999 > 2^64

and long long is 2^64 on most platforms as there is simply too little hardware support for 128-bit integers.

Thanks to GCC -O3 for pointing this out in a warning: it can suppose that UB does not happen and run the loop less times. See also: Why does this loop produce "warning: iteration 3u invokes undefined behavior" and output more than 4 lines?

So anything can happen.

Let's decompile it to see what is really happening

GCC 4.8:

gcc -c -g -std=c99 -O0 a.c
objudmp -S a.o

Output:

a.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <main>:
int main() {
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
    long long r = 0;
   4:   48 c7 45 e0 00 00 00    movq   $0x0,-0x20(%rbp)
   b:   00 
    long long k = 0;
   c:   48 c7 45 e8 00 00 00    movq   $0x0,-0x18(%rbp)
  13:   00 
    for (; k < 9999999999999; k++) 
  14:   e9 f8 00 00 00          jmpq   111 <main+0x111>
    {
        for (long long i = 0; i < 9999999999999; i++) 
  19:   48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
  20:   00 
  21:   e9 d2 00 00 00          jmpq   f8 <main+0xf8>
        {
            for (long long j = 0; j < 9999999999999; j++) 
  26:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
  2d:   00 
  2e:   e9 ac 00 00 00          jmpq   df <main+0xdf>
            {
                r = (r + (i * j) % 100) % 47;
  33:   48 8b 45 f0             mov    -0x10(%rbp),%rax
  37:   48 0f af 45 f8          imul   -0x8(%rbp),%rax
  3c:   48 89 c1                mov    %rax,%rcx
  3f:   48 ba 0b d7 a3 70 3d    movabs $0xa3d70a3d70a3d70b,%rdx
  46:   0a d7 a3 
  49:   48 89 c8                mov    %rcx,%rax
  4c:   48 f7 ea                imul   %rdx
  4f:   48 8d 04 0a             lea    (%rdx,%rcx,1),%rax
  53:   48 c1 f8 06             sar    $0x6,%rax
  57:   48 89 c2                mov    %rax,%rdx
  5a:   48 89 c8                mov    %rcx,%rax
  5d:   48 c1 f8 3f             sar    $0x3f,%rax
  61:   48 29 c2                sub    %rax,%rdx
  64:   48 89 d0                mov    %rdx,%rax
  67:   48 c1 e0 02             shl    $0x2,%rax
  6b:   48 01 d0                add    %rdx,%rax
  6e:   48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
  75:   00 
  76:   48 01 d0                add    %rdx,%rax
  79:   48 c1 e0 02             shl    $0x2,%rax
  7d:   48 29 c1                sub    %rax,%rcx
  80:   48 89 ca                mov    %rcx,%rdx
  83:   48 8b 45 e0             mov    -0x20(%rbp),%rax
  87:   48 8d 0c 02             lea    (%rdx,%rax,1),%rcx
  8b:   48 ba 99 5c 41 4c ae    movabs $0x572620ae4c415c99,%rdx
  92:   20 26 57 
  95:   48 89 c8                mov    %rcx,%rax
  98:   48 f7 ea                imul   %rdx
  9b:   48 c1 fa 04             sar    $0x4,%rdx
  9f:   48 89 c8                mov    %rcx,%rax
  a2:   48 c1 f8 3f             sar    $0x3f,%rax
  a6:   48 29 c2                sub    %rax,%rdx
  a9:   48 89 d0                mov    %rdx,%rax
  ac:   48 89 45 e0             mov    %rax,-0x20(%rbp)
  b0:   48 8b 55 e0             mov    -0x20(%rbp),%rdx
  b4:   48 89 d0                mov    %rdx,%rax
  b7:   48 01 c0                add    %rax,%rax
  ba:   48 01 d0                add    %rdx,%rax
  bd:   48 c1 e0 04             shl    $0x4,%rax
  c1:   48 29 d0                sub    %rdx,%rax
  c4:   48 29 c1                sub    %rax,%rcx
  c7:   48 89 c8                mov    %rcx,%rax
  ca:   48 89 45 e0             mov    %rax,-0x20(%rbp)
                if (r != 0) 
  ce:   48 83 7d e0 00          cmpq   $0x0,-0x20(%rbp)
  d3:   74 05                   je     da <main+0xda>
                {
                    r++;
  d5:   48 83 45 e0 01          addq   $0x1,-0x20(%rbp)
    long long k = 0;
    for (; k < 9999999999999; k++) 
    {
        for (long long i = 0; i < 9999999999999; i++) 
        {
            for (long long j = 0; j < 9999999999999; j++) 
  da:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
  df:   48 b8 fe 9f 72 4e 18    movabs $0x9184e729ffe,%rax
  e6:   09 00 00 
  e9:   48 39 45 f8             cmp    %rax,-0x8(%rbp)
  ed:   0f 8e 40 ff ff ff       jle    33 <main+0x33>
int main() {
    long long r = 0;
    long long k = 0;
    for (; k < 9999999999999; k++) 
    {
        for (long long i = 0; i < 9999999999999; i++) 
  f3:   48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
  f8:   48 b8 fe 9f 72 4e 18    movabs $0x9184e729ffe,%rax
  ff:   09 00 00 
 102:   48 39 45 f0             cmp    %rax,-0x10(%rbp)
 106:   0f 8e 1a ff ff ff       jle    26 <main+0x26>
int main() {
    long long r = 0;
    long long k = 0;
    for (; k < 9999999999999; k++) 
 10c:   48 83 45 e8 01          addq   $0x1,-0x18(%rbp)
 111:   48 b8 fe 9f 72 4e 18    movabs $0x9184e729ffe,%rax
 118:   09 00 00 
 11b:   48 39 45 e8             cmp    %rax,-0x18(%rbp)
 11f:   0f 8e f4 fe ff ff       jle    19 <main+0x19>
 125:   b8 00 00 00 00          mov    $0x0,%eax
                    r++;
                }
            }
        }
    }
}
 12a:   5d                      pop    %rbp
 12b:   c3                      retq

so all the loops seems present.

With -O3:

0000000000000000 <main>:
   0:   31 c0                   xor    %eax,%eax
   2:   c3                      retq

so it was clearly optimized away. Nothing in the C standard prevents that optimization, so GCC does it.

If we add a printf("%lld", r); at the end as mentioned by Mystical, it does take forever even with -O3, and the generated optimized code is almost the same as the non optimized one.

For infinite loops it gets more interesting: Are compilers allowed to eliminate infinite loops?

How to prevent if from happening: How to prevent GCC from optimizing out a busy wait loop?

Why code with multiple nested loops can finish immediately on GCC but take forever on VS?

2 Answers2