132

This is a question that came to mind while reading the brilliant answer by Mysticial to the question: why is it faster to process a sorted array than an unsorted array?

Context for the types involved:

const unsigned arraySize = 32768;
int data[arraySize];
long long sum = 0;

In his answer he explains that the Intel Compiler (ICC) optimizes this:

for (int i = 0; i < 100000; ++i)
    for (int c = 0; c < arraySize; ++c)
        if (data[c] >= 128)
            sum += data[c];

...into something equivalent to this:

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

The optimizer is recognizing that these are equivalent and is therefore exchanging the loops, moving the branch outside the inner loop. Very clever!

But why doesn't it do this?

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000 * data[c];

Hopefully Mysticial (or anyone else) can give an equally brilliant answer. I've never learned about the optimizations discussed in that other question before, so I'm really grateful for this.

Community
  • 1
  • 1
jhabbott
  • 18,461
  • 9
  • 58
  • 95
  • Yeah, I was also wondering the same. Why didn't the compiler take it a step further? – Mysticial Jun 30 '12 at 17:56
  • 14
    That's something that probably only Intel knows. I don't know what order it runs its optimization passes. And apparently, it does not run a loop-collapsing pass after loop-interchange. – Mysticial Jun 30 '12 at 18:02
  • 7
    This optimization is only valid if the values contained in the data array are immutable. For instance, if the are [memory mapped](http://en.wikipedia.org/wiki/Memory-mapped_I/O) to an input/output device each time you read data[0] will produce a different value... – Thomas C. G. de Vilhena Jun 30 '12 at 18:02
  • 2
    What data type is this, integer or floating-point? Repeated addition in floating-point gives very different results from multiplication. – Ben Voigt Jun 30 '12 at 18:02
  • @BenVoigt They're integers. The OP took it straight out of the linked question. – Mysticial Jun 30 '12 at 18:03
  • 1
    Perhaps their optimization passes are in a (for this situation) unfortunate order. – harold Jun 30 '12 at 18:04
  • What happens if you access the array through a `restrict` pointer (C99)? – orlp Jun 30 '12 at 18:09
  • Maybe this is related to signed integer overflows being undefined. Could you retry with unsigned integer types? – ouah Jun 30 '12 at 18:14
  • Clang's (v3.1 from XCode) behaviour is funny: it checks if `arraySize == 0` and if so it runs `100000` empty iterations decrementing `%ecx`, all this at optimisation level `-O3` :) Go figure... – Hristo Iliev Jun 30 '12 at 18:19
  • 6
    @Thomas: If the data were `volatile`, then the loop interchange would be an invalid optimization as well. – Ben Voigt Jun 30 '12 at 18:26
  • @Mysticial - my apologies for spelling your handle incorrectly in the question and comments - thanks for correcting. – jhabbott Jun 30 '12 at 18:40
  • Funny, even when you feed ICC the interchanged code, it still doesn't collapse the inner loop into a multiply... Yeah, go figure... – Mysticial Jun 30 '12 at 18:50
  • 2
    I think `100000 * data[c]` should read `100000LL * data[c]` otherwise you can get an int overflow when computing the intermediate result. – Paul Hankin Jun 30 '12 at 19:09
  • 3
    GNAT (Ada compiler with GCC 4.6) won't switch the loops at O3, but if the loops are switched, it will convert it into a multiplication. – prosfilaes May 21 '13 at 20:06

7 Answers7

106

The compiler can't generally transform

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

into

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000 * data[c];

because the latter could lead to overflow of signed integers where the former doesn't. Even with guaranteed wrap-around behaviour for overflow of signed two's complement integers, it would change the result (if data[c] is 30000, the product would become -1294967296 for the typical 32-bit ints with wrap around, while 100000 times adding 30000 to sum would, if that doesn't overflow, increase sum by 3000000000). Note that the same holds for unsigned quantities, with different numbers, overflow of 100000 * data[c] would typically introduce a reduction modulo 2^32 that must not appear in the final result.

It could transform it into

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000LL * data[c];  // resp. 100000ull

though, if, as usual, long long is sufficiently larger than int.

Why it doesn't do that, I can't tell, I guess it's what Mysticial said, "apparently, it does not run a loop-collapsing pass after loop-interchange".

Note that the loop-interchange itself is not generally valid (for signed integers), since

for (int c = 0; c < arraySize; ++c)
    if (condition(data[c]))
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

can lead to overflow where

for (int i = 0; i < 100000; ++i)
    for (int c = 0; c < arraySize; ++c)
        if (condition(data[c]))
            sum += data[c];

wouldn't. It's kosher here, since the condition ensures all data[c] that are added have the same sign, so if one overflows, both do.

I wouldn't be too sure that the compiler took that into account, though (@Mysticial, could you try with a condition like data[c] & 0x80 or so that can be true for positive and negative values?). I had compilers make invalid optimisations (for example, a couple of years ago, I had an ICC (11.0, iirc) use signed-32-bit-int-to-double conversion in 1.0/n where n was an unsigned int. Was about twice as fast as gcc's output. But wrong, a lot of values were larger than 2^31, oops.).

Community
  • 1
  • 1
Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • 4
    I remember a version of MPW compiler which added an option to allow stack frames larger than 32K [earlier versions were limited using @A7+int16 addressing for local variables]. It got everything right for stack frames below 32K or over 64K, but for a 40K stack frame it would use `ADD.W A6,$A000`, forgetting that word operations with address registers sign-extend the word to 32 bits before the add. Took awhile to troubleshoot, since the only thing the code did between that `ADD` and the next time it popped A6 off the stack was to restore caller's registers it has saved to that frame... – supercat Apr 03 '13 at 16:46
  • 3
    ...and the only register the caller happened to care about was the [load-time constant] address of a static array. The compiler knew that the address of the array was saved in a register so it could optimize based upon that, but the debugger simply knew the address of a constant. Thus, before a statement `MyArray[0] = 4;` I could check the adddress of `MyArray`, and look at that location before and after the statement executed; it wouldn't change. Code was something like `move.B @A3,#4` and A3 was supposed to always point to `MyArray` any time that instruction executed, but it didn't. Fun. – supercat Apr 03 '13 at 16:51
  • then why does clang perform this kind of optimization? – Jason S Dec 31 '19 at 16:14
  • 1
    The compiler could perform that rewrite in its internal intermediate representations, because it's allowed to have less undefined behaviour in its internal intermediate representations. – user253751 Jan 06 '20 at 17:35
48

This answer does not apply to the specific case linked, but it does apply to the question title and may be interesting to future readers:

Due to finite precision, repeated floating-point addition is not equivalent to multiplication. Consider:

float const step = 1e-15;
float const init = 1;
long int const count = 1000000000;

float result1 = init;
for( int i = 0; i < count; ++i ) result1 += step;

float result2 = init;
result2 += step * count;

cout << (result1 - result2);

Demo

Meraj al Maksud
  • 1,528
  • 2
  • 22
  • 36
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 10
    This is no answer to the asked question. Despite interesting information (and a must know for any C/C++ programmer), this is no forum, and doesn't belong here. – orlp Jun 30 '12 at 18:11
  • 31
    @nightcracker: The stated goal of StackOverflow is to build a searchable library of answers useful to future users. And this is an answer to the question asked... it just so happens that there is some unstated information that makes this answer not apply for the original poster. It may still apply for others with the same question. – Ben Voigt Jun 30 '12 at 18:14
  • 12
    It's _could_ be an answer to the question __title__, but not the question, no. – orlp Jun 30 '12 at 18:15
  • 3
    @nightcracker: The question didn't say whether `data` was an integral or floating-point type. A third-party edited that in. – Ben Voigt Jun 30 '12 at 18:16
  • I think this answer is useful - I hadn't considered the more general cases to which the same type of optimization might apply. – jhabbott Jun 30 '12 at 18:17
  • 1
    @Ben Voigt: sure, it was edited in. But even then it was clear that the code came from the answer linked, and even then it was clear it consisted of integers (I checked that when I first read the question). – orlp Jun 30 '12 at 18:19
  • @nightcracker: There's nothing in the question, other than the code snippet, which specifies that the question is solely about integers in a certain range. This is a perfectly valid answer to the question. – Nathan Fellman Jun 30 '12 at 18:19
  • @nightcracker: And now jhabbott has said he did want to know about both cases (integral and float) – Ben Voigt Jun 30 '12 at 18:19
  • 7
    As I said, it is __interesting__ information. Yet it still seems wrong to me that nota bene the __top answer__ of the question __does not answer the question as it stands, now__. This simply is not the reason the Intel Compiler decided not to optimize, basta. – orlp Jun 30 '12 at 18:21
  • 4
    @nightcracker: It seems wrong to me too that this is the top answer. I'm hoping someone posts a really good answer for the integer case that surpasses this one in score. Unfortunately, I don't think there is an answer for "can't" for the integer case, because the transformation would be legal, so we're left with "why it doesn't", which actually falls afoul of the "too localized" close reason, because it's peculiar to a particular compiler version. The question I answered is the more important one, IMO. – Ben Voigt Jun 30 '12 at 18:25
  • Perhaps the comments by Mystical and harold are the answer - it doesn't repeat optimization passes to see if they now apply due to the results of other passes? I actually now think my question may be based on a false assumption (due to me not knowing much about optimization) that there might be a good reason for not doing this... but other comments suggest 'loop collapsing' *is* something that optimizers sometimes do. – jhabbott Jun 30 '12 at 18:33
6

The compiler contains various passes which does the optimization. Usually in each pass either an optimization on statements or loop optimizations are done. At present there is no model which does an optimization of loop body based on the loop headers. This is hard to detect and less common.

The optimization which was done was loop invariant code motion. This can be done using a set of techniques.

knightrider
  • 2,063
  • 1
  • 16
  • 29
6

It does now -- at least, clang does:

long long add_100k_signed(int *data, int arraySize)
{
    long long sum = 0;

    for (int c = 0; c < arraySize; ++c)
        if (data[c] >= 128)
            for (int i = 0; i < 100000; ++i)
                sum += data[c];
    return sum;
}

compiles with -O1 to

add_100k_signed:                        # @add_100k_signed
        test    esi, esi
        jle     .LBB0_1
        mov     r9d, esi
        xor     r8d, r8d
        xor     esi, esi
        xor     eax, eax
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        movsxd  rdx, dword ptr [rdi + 4*rsi]
        imul    rcx, rdx, 100000
        cmp     rdx, 127
        cmovle  rcx, r8
        add     rax, rcx
        add     rsi, 1
        cmp     r9, rsi
        jne     .LBB0_4
        ret
.LBB0_1:
        xor     eax, eax
        ret

Integer overflow has nothing to do with it; if there is integer overflow that causes undefined behavior, it can happen in either case. Here's the same kind of function using int instead of long:

int add_100k_signed(int *data, int arraySize)
{
    int sum = 0;

    for (int c = 0; c < arraySize; ++c)
        if (data[c] >= 128)
            for (int i = 0; i < 100000; ++i)
                sum += data[c];
    return sum;
}

compiles with -O1 to

add_100k_signed:                        # @add_100k_signed
        test    esi, esi
        jle     .LBB0_1
        mov     r9d, esi
        xor     r8d, r8d
        xor     esi, esi
        xor     eax, eax
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        mov     edx, dword ptr [rdi + 4*rsi]
        imul    ecx, edx, 100000
        cmp     edx, 127
        cmovle  ecx, r8d
        add     eax, ecx
        add     rsi, 1
        cmp     r9, rsi
        jne     .LBB0_4
        ret
.LBB0_1:
        xor     eax, eax
        ret
Jason S
  • 184,598
  • 164
  • 608
  • 970
4

Well, I'd guess that some compilers might do this sort of optimization, assuming that we are talking about Integer Arithmetics.

At the same time, some compilers might refuse to do it because replacing repetitive addition with multiplication might change the overflow behavior of the code. For unsigned integer types, it shouldn't make a difference since their overflow behavior is fully specified by the language. But for signed ones, it might (probably not on 2's complement platform though). It is true that signed overflow actually leads to undefined behavior in C, meaning that it should be perfectly OK to ignore that overflow semantics altogether, but not all compilers are brave enough to do that. It often draws a lot of criticism from the "C is just a higher-level assembly language" crowd. (Remember what happened when GCC introduced optimizations based on strict-aliasing semantics?)

Historically, GCC has shown itself as a compiler that has what it takes to take such drastic steps, but other compilers might prefer to stick with the perceived "user-intended" behavior even if it is undefined by the language.

Meraj al Maksud
  • 1,528
  • 2
  • 22
  • 36
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • I'd prefer to know if I'm accidentally depending on undefined behaviour, but I guess the compiler has no way to know as the overflow would be a run-time issue :/ – jhabbott Jun 30 '12 at 18:14
  • 2
    @jhabbott: __iff__ the overflow occurs, then there is undefined behaviour. Whether the behavior is defined is unknown until runtime (assuming the numbers are input at runtime) :P. – orlp Jun 30 '12 at 18:25
2

There's a conceptual barrier to this kind of optimization. Compiler authors spend a lot of effort on strength reduction -- for instance, replacing multiplications with adds and shifts. They get used to thinking that multiplies are bad. So a case where one ought to go the other way is surprising and counterintuitive. So nobody thinks to implement it.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • 3
    Replacing a loop with a closed-form calculation is also strength reduction, isn't it? – Ben Voigt Jun 30 '12 at 18:21
  • Formally, yes, I suppose, but I've never heard anyone talk about it that way. (I'm a bit out of date on the literature, though.) – zwol Jun 30 '12 at 18:22
1

The people who develop and maintain compilers have a limited amount of time and energy to spend on their work, so they generally want to focus on what their users care about the most: turning well-written code into fast code. They don't want to spend their time trying to find ways to turn silly code into fast code—that's what code review is for. In a high-level language, there may be "silly" code that expresses an important idea, making it worth the developers' time to make that fast—for example, short cut deforestation and stream fusion allow Haskell programs structured around certain kinds of lazily produced data structures to be compiled into tight loops that don't allocate memory. But that sort of incentive simply does not apply to turning looped addition into multiplication. If you want it to be fast, just write it with multiplication.

dfeuer
  • 48,079
  • 5
  • 63
  • 167