3

I've reduced my code down to the following, which is as simple as I could make it whilst retaining the compiler output that interests me.

void foo(const uint64_t used)
{
    uint64_t ar[100];
    for(int i = 0; i < 100; ++i)
    {
        ar[i] = some_global_array[i];
    }

    const uint64_t mask = ar[0];
    if((used & mask) != 0)
    {
        return;
    }

    bar(ar); // Not inlined
}

Using VC10 with /O2 and /Ob1, the generated assembly pretty much reflects the order of instructions in the above C++ code. Since the local array ar is only passed to bar() when the condition fails, and is otherwise unused, I would have expected the compiler to optimize to something like the following.

if((used & some_global_array[0]) != 0)
{
    return;
}

// Now do the copying to ar and call bar(ar)...

Is the compiler not doing this because it's simply too hard for it to identify such optimizations in the general case? Or is it following some strict rule that forbids it from doing so? If so, why, and is there some way I can give it a hint that doing so wouldn't change the semantics of my program?

Note: obviously it would be trivial to obtain the optimized output by just rearranging the code, but I'm interested in why the compiler won't optimize in such cases, not how to do so in this (intentionally simplified) case.

kamrann
  • 219
  • 4
  • 10

2 Answers2

3

Probably the reason why this is not getting optimized is the global array. The compiler can't know beforehand if, say, accessing some_global_array[99] will result in some kind of exception/signal being generated so it has to execute the whole loop. Things would be pretty different if the global array was statically defined in the same compilation unit.

For example, in LLVM, the following three definitions of the global array will yield wildly differing outputs of that function:

// this yields pretty much what you're seeing
uint64_t *some_global_array; 
// this calls memcpy and then performs the conditional check
uint64_t some_global_array[100] = {0};
// this calls memset (not memcpy!) on the ar array and then bar directly (no 
// conditional checks since the array is const and filled with 0s, so the if
// is always false) 
const uint64_t some_global_array[100] = {0};

The second is pretty puzzling, but it may simply be a missed optimization (or maybe I'm missing something else).

CAFxX
  • 28,060
  • 6
  • 41
  • 66
  • Right, I too thought the global was likely to be the problem, but still like you I was surprised that there was no optimization for the 2nd case, which is how I had the global defined. In fact VC10 still copies then checks the condition even with a const global. Anyway, I simplified further to a foo() accepting 3 extra const integer args a, b & c, putting them into a local array of size 3, entirely doing away with the global, then testing on used & a. It still assigned b & c before testing the condition. Any ideas? – kamrann Feb 25 '12 at 09:21
  • Jokes aside: http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-February/047746.html – CAFxX Feb 25 '12 at 11:18
  • Thanks, will keep tabs on that thread. A simple struct with 2 int members also yields the same results as an array. In such a case it's only potentially wasting a single asm instruction, but still I just fail to see why it wouldn't swap them over anyway. Regards the joke, yeah the list of reasons to do so has been growing quickly recently... – kamrann Feb 25 '12 at 11:55
1

There are no "strict rules" controlling what kind of assembly language the compiler is permitted to output. If the compiler can be certain that a block of code does not need to be executed (because it has no side effects) due to some precondition, then it is absolutely permitted to short-circuit the whole thing.

This sort of optimisation can be fairly complex in the general case, and your compiler might not go to all that effort. If this is performance critical code, then you can fine-tune your source code (as you suggest) to help the compiler generate the best assembly code. This is a trial-and-error process though, and you might have to do it again for the next version of the compiler.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • I think 'strict rule' was bad phrasing, I guess what I was really asking is, given that it is permitted to do whatever it wants so long as it can determine it isn't changing program logic, is there something specific in this case that is preventing it from assuming there are no side effects of assigning to a local array which one control path just discards without using? If the answer is just that it's difficult in the general case, fair enough, it would just be surprising to me since this seems like it would be one of the easiest things to detect. – kamrann Feb 25 '12 at 09:13
  • The only thing I can think of is the compiler cannot know if there will be possible interactions with other threads, especially with global arrays. – Greg Hewgill Feb 25 '12 at 09:25
  • So an even more simplified case still doesn't get optimized on any of the major compilers, and [according to the LLVM guys](http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-February/047750.html) it is simply too tough in the general case as you said. – kamrann Feb 27 '12 at 23:33