42

The C Standard explicitly specifies signed integer overflow as having undefined behavior. Yet most CPUs implement signed arithmetics with defined semantics for overflow (except maybe for division overflow: x / 0 and INT_MIN / -1).

Compilers writers have been taking advantage of the undefinedness of such overflows to add more aggressive optimisations that tend to break legacy code in very subtle ways. For example this code may have worked on older compilers but does not anymore on current versions of gcc and clang:

/* Increment a by a value in 0..255, clamp a to positive integers.
   The code relies on 32-bit wrap-around, but the C Standard makes
   signed integer overflow undefined behavior, so sum_max can now 
   return values less than a. There are Standard compliant ways to
   implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
    int res = a + b;
    return (res >= a) ? res : INT_MAX;
}

Is there hard evidence that these optimisations are worthwhile? Are there comparative studies documenting the actual improvements on real life examples or even on classical benchmarks?

I came up with this question as I was watching this: C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”

I am tagging c and c++ as the problem is similar in both languages but the answers might be different.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/193060/discussion-on-question-by-chqrlie-is-there-some-meaningful-statistical-data-to-j). – Samuel Liew May 08 '19 at 23:43
  • 3
    The reason C says signed integer overflow is undefined is that some CPUs use "2's complement", some use "1's compliment", some use "sign and magnitude"; and for all the cases overflow could cause anything (e.g. CPUs like MIPS have "trap on overflow"). In other words it's about portability and not optimisation. – Brendan May 08 '19 at 23:53
  • 1
    Exactly. The only 'meaningful statistic' anybody needs is that ones-complement and sign-magnitude computers exist. – user207421 May 09 '19 at 01:50
  • 1
    @user207421: Yes, that's a good question, to which the answer seems to be *no longer*. Hence the current proposal to remove support for non two's complement representations: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2330.pdf – chqrlie May 09 '19 at 05:16
  • @Brendan: ones' complement architecture are a thing of the past. MIPS' trap on overflow is selectable. – chqrlie May 09 '19 at 05:19
  • 1
    @chqrlie: C is also a thing from the past (47 years old now). There's a lot of design decisions that made sense back then that don't make sense anymore but continue to exist because change would break too much existing software. – Brendan May 09 '19 at 20:41

4 Answers4

26

I don't know about studies and statistics, but yes, there are definitely optimizations taking this into account that compilers actually do. And yes, they are very important (tldr loop vectorization for example).

Besides the compiler optimizations, there is another aspect to be taken into account. With UB you get C/C++ signed integers to behave arithmetically as you would expect mathematically. For instance x + 10 > x holds true now (for valid code of course), but would not on a wrap-around behavior.

I've found an excellent article How undefined signed overflow enables optimizations in GCC from Krister Walfridsson’s blog listing some optimizations that take signed overflow UB into account. The following examples are from it. I am adding c++ and assembly examples to them.

If the optimizations look too simple, uninteresting or unimpactful, remember that these optimization are just steps in a much much larger chain of optimizations. And the butterfly effect does happen as a seemingly unimportant optimization at an earlier step can trigger a much more impactful optimization at a later step.

If the examples look nonsensical (who would write x * 10 > 0) keep in mind that you can very easily get to this kind of examples in C and C++ with constants, macros, templates. Besides the compiler can get to this kind of examples when applying transformations and optimizations in its IR.

Signed integer expression simplification

  • Eliminate multiplication in comparison with 0

    (x * c) cmp 0   ->   x cmp 0 
    
    bool foo(int x) { return x * 10 > 0 }
    
    foo(int):
            test    edi, edi
            setg    al
            ret
    
  • Eliminate division after multiplication

    (x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2

    int foo(int x) { return (x * 20) / 10; }
    
    foo(int):
            lea     eax, [rdi+rdi]
            ret
    
  • Eliminate negation

    (-x) / (-y) -> x / y

    int foo(int x, int y) { return (-x) / (-y); }
    
    foo(int, int):
            mov     eax, edi
            cdq
            idiv    esi
            ret
    
  • Simplify comparisons that are always true or false

    x + c < x       ->   false
    x + c <= x      ->   false
    x + c > x       ->   true
    x + c >= x      ->   true
    
    bool foo(int x) { return x + 10 >= x; }
    
    foo(int):
            mov     eax, 1
            ret
    
  • Eliminate negation in comparisons

    (-x) cmp (-y)   ->   y cmp x
    
    bool foo(int x, int y) { return -x < -y; }
    
    foo(int, int):
            cmp     edi, esi
            setg    al
            ret
    
  • Reduce magnitude of constants

    x + c > y       ->   x + (c - 1) >= y
    x + c <= y      ->   x + (c - 1) < y
    
    bool foo(int x, int y) { return x + 10 <= y; }
    
    foo(int, int):
            add     edi, 9
            cmp     edi, esi
            setl    al
            ret
    
  • Eliminate constants in comparisons

    (x + c1) cmp c2         ->   x cmp (c2 - c1)
    (x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
    

    The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

    bool foo(int x) { return x + 42 <= 11; }
    
    foo(int):
            cmp     edi, -30
            setl    al
            ret
    

Pointer arithmetic and type promotion

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

This is a very important optimization as loop vectorization one of the most efficient and effective optimization algorithms.

This is an example when changing an index from an unsigned index to a signed improves the generated assembly:

Unsigned version

#include <cstddef>

auto foo(int* v, std::size_t start)
{
    int sum = 0;

    for (std::size_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

With unsigned the case where start + 4 wraps around must be taken into account and a branch is generated to deal with this case (branches are bad for performance):

; gcc on x64 with -march=skylake

foo1(int*, unsigned long):
        cmp     rsi, -5
        ja      .L3
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo1(int*, unsigned long):                             # @foo1(int*, unsigned long)
        xor     eax, eax
        cmp     rsi, -4
        jae     .LBB0_2
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
.LBB0_2:
        ret

As a side note, using a narrower type would result in even worst assembly, inhibiting the use of SSE vectorized instructions:

#include <cstddef>

auto foo(int* v, unsigned start)
{
    int sum = 0;

    for (unsigned i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, unsigned int):
        cmp     esi, -5
        ja      .L3
        mov     eax, esi
        mov     eax, DWORD PTR [rdi+rax*4]
        lea     edx, [rsi+1]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+2]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+3]
        add     eax, DWORD PTR [rdi+rdx*4]
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo(int*, unsigned int):                              # @foo(int*, unsigned int)
        xor     eax, eax
        cmp     esi, -5
        ja      .LBB0_3
        mov     ecx, esi
        add     esi, 4
        mov     eax, dword ptr [rdi + 4*rcx]
        lea     rdx, [rcx + 1]
        cmp     rdx, rsi
        jae     .LBB0_3
        add     eax, dword ptr [rdi + 4*rcx + 4]
        add     eax, dword ptr [rdi + 4*rcx + 8]
        add     eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
        ret

Signed version

Using a signed index however results in nice vectorized branchless code:

#include <cstddef>

auto foo(int* v, std::ptrdiff_t start)
{
    int sum = 0;

    for (std::ptrdiff_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, long):
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, long):                              # @foo(int*, long)
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

Vectorized instruction are still used when using a narrower signed type:

#include <cstddef>

auto foo(int* v, int start)
{
    int sum = 0;

    for (int i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, int):
        movsx   rsi, esi
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, int):                              # @foo(int*, int)
        movsxd  rax, esi
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rax + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

Value range calculations

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as

int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;

it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to int z = y >> 2; as the compiler knows that y is non-negative.

auto foo(int x)
{
    if (x <= 0)
        __builtin_unreachable();
    
    return (x + 5) / 4;
}
foo(int):
        lea     eax, [rdi+5]
        sar     eax, 2
        ret

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

  • Changing comparisons x<y to true or false if the ranges for x and y does not overlap
  • Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap
  • Changing abs(x) to x or -x if the range does not cross 0
  • Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2
  • Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

Loop analysis and optimization

The canonical example of why undefined signed overflow helps loop optimizations is that loops like

for (int i = 0; i <= m; i++)

are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.

bolov
  • 72,283
  • 15
  • 145
  • 224
  • "x + 10 > x holds true now" - no it doesn't. It could format your hard disk if x happens to equal INT_MAX-5. – user253751 May 09 '19 at 02:33
  • 1
    Very informative answer (the blog is a good read indeed). Since all these optimisations are only possible for signed operands, the conclusion is counter-intuitive: using `size_t` index values for many computations should lead to less efficient code because of modulo arithmetics semantics. This is a good argument to include `ssize_t` in the C Standard, or some unsigned type without modulo arithmetics. – chqrlie May 09 '19 at 05:50
  • fascinating. nice find. – Richard Hodges May 09 '19 at 06:18
  • 4
    @immibis I meant it holds true for all valid code. UB is not valid code. – bolov May 09 '19 at 09:41
  • 2
    @chqrlie I've heard on cppcon prominent members of the standard committee acknowledging that unsigned size was a mistake in hindsight. The same thing about the wraparound behavior of unsigned (I would have been better for normal types to have UB on overflow and to have separate types with wraparound behavior for when you explicitly need that) – bolov May 09 '19 at 09:45
  • @bolov The compiler is allowed to assume it's true, but it's not required to, so you can't. – user253751 May 09 '19 at 21:46
  • 1
    @immibis no, it is required by the standard. Just like it is required by the standard to assume a pointer that is dereferenced is not null. – bolov May 09 '19 at 22:48
  • @bolov Really? The compiler is *required* to assume that `i + 10 > i` is true? So `int i = INT_MAX; if(i + 10 > i) printf("hello!\n");` is *required* to print "hello!\n" and is not allowed to format my hard disk? – user253751 May 09 '19 at 23:31
  • 1
    @immibis you are not getting it. Yes!! the compiler is required to assume that `i + 10 > i` so it will generate code that is valid only when `i < INT_MAX - 10`. For then `i >= INT_MAX` the code is not valid, the standard does not mandate any behavior, and the compiler can generate code making the assumption `i < INT_MAX`. For invalid code - undefined behavior anything can happen because the compiler isn't required to generate any kind of behavior. – bolov May 10 '19 at 00:07
  • @bolov It is also allowed to generate code where `i + 10 > i` evaluates to false in this case. – user253751 May 10 '19 at 00:11
  • @immibis honestly it's like talking to a wall – bolov May 10 '19 at 00:12
  • I know right? "With UB you get C/C++ signed integers to behave arithmetically as you would expect mathematically." is pretty clearly defining UB. – user253751 May 10 '19 at 00:13
  • 2
    @immibis do you see how the compiler transforms `x + 10 >= x;` into `mov eax, 1`? That's precisely because it can assume `x + 10 >= x` will always be true (and I say it again for the last time: **for valid code**) – bolov May 10 '19 at 00:26
  • 1
    The compiler can assume that. The programmer can't. – user253751 May 10 '19 at 00:31
  • @immibis ... are you trolling? 1st: when were we talking about the programmer? All the discussion, the entire post and comments are about what the compiler can assume (and what can optimize in consequence). 2nd also the programmer can assume that on valid code `x + 10 >= x`. ......... – bolov May 10 '19 at 00:36
  • @bolov The programmer has to know whether `x + 10 >= x` would be true on a compiler that had defined overflow behaviour, in order to know whether the code is valid. They can't reason "this code is valid therefore I know `x <= INT_MAX+10`". They have to reason "`x <= INT_MAX+10` because of something else, therefore this code is valid." – user253751 May 10 '19 at 00:40
  • 2
    @immibis you are doing this on purpose at this point. You are moving the discussion away from the original problem. My statement is "the compiler will assume that, for valid code, `x + 10 >= x` will always be true - this is mandated by the standard - so the compiler can optimize the expression to `true`". That was always my statement. That's what I am debating (and what you contested initially). You want to talk now about how the programmer must make sure he doesn't write code that has UB and therefore is invalid, that's another discussion. – bolov May 10 '19 at 00:46
  • 1
    You did not say "the compiler will assume that ..." - you simply said "..." which is a statement of fact, not a statement of an assumption that if not true will break things. – user253751 May 10 '19 at 01:01
  • I think I finally see your point. Yes, I did not say "the compiler" and initially didn't say "for valid code". I thought however that it is understood, as we are talking exclusively about what the compiler can do with given code. It looks to me now that you took "x + 1 >= 1 will always be true" as a statement meaning "whenever you see `x + 1 >= 1` in a C code you, the programmer, can rest assure that it will always be true". That was never my intention and I think that from the context of the post is clear what I am talking about. I apologise from assuming malicious intent. – bolov May 10 '19 at 01:14
  • "Trolling" and talking past each others are often horribly like the other. – curiousguy May 11 '19 at 14:54
  • There are a variety of useful transforms that can be facilitated by declaring that integer overflow is UB, but could also be facilitated by specifying that some aspects of behavior are Unspecified. There is a difference between allowing a compiler to process `x+y > z` as arbitrarily yielding 0 or 1 without side effects in case of overflow, versus allowing arbitrarily disruptive side effects. If the only situations where it would matter whether the function returns 0 or 1 are those where no overflow occurs, treating an overflow in that expression as arbitrarily yielding 0 or 1... – supercat Feb 16 '22 at 20:31
  • ...may be sufficient to allow optimizations that would not be possible if a programmer included explicit code to prevent overflow at all costs. – supercat Feb 16 '22 at 20:32
7

Not quite an example of optimization, but one useful consequence of undefined behaviour is -ftrapv command line switch of GCC/clang. It inserts code which crashes your program on integer overflow.

It won't work on unsigned integers, in accordance with the idea that unsigned overflow is intentional.

The Standard's wording on signed integer overflow ensures that people won't write overflowing code on purpose, so ftrapv is a useful tool to discover unintentional overflow.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • Yes. Producing an incorrectly value is a serious issue, and a correct looking value is more problematic. In many cases there is no expectation on the result so any printed value might be taken seriously. That could even cause security vulnerabilities. – curiousguy May 13 '19 at 07:30
  • The Standard is agnostic as to whether a construct that invokes UB should be regarded as erroneous or non-portable. If a program never deliberately accesses e.g. a `char[10][10]` with an inner subscript greater than nine, having an implementation trap if it does so may be useful. On the other hand, some tasks could be accomplished more efficiently on an implementation that would allow the inner subscript to access data on multiple rows than one which would require that a programmer replace a loop which would iterate over the items in multiple rows with a double loop that... – supercat Feb 16 '22 at 21:05
  • ...handles each row separately. While the Standard would allow an implementation to assume it will never be fed anything other than Strictly Conforming programs, that does not mean such an assumption would be reasonable for most implementations. – supercat Feb 16 '22 at 21:06
7

Here's an actual little benchmark, bubble sort. I've compared timings without/with -fwrapv (which means the overflow is UB/not UB). Here are the results (seconds):

                   -O3     -O3 -fwrapv    -O1     -O1 -fwrapv
Machine1, clang    5.2     6.3            6.8     7.7
Machine2, clang-8  4.2     7.8            6.4     6.7
Machine2, gcc-8    6.6     7.4            6.5     6.5

As you can see, the not-UB (-fwrapv) version is almost always slower, the largest difference is pretty big, 1.85x.

Here's the code. Note, that I intentionally chose an implementation, which should produce a larger difference for this test.

#include <stdio.h>
#include <stdlib.h>

void bubbleSort(int *a, long n) {
        bool swapped;
        for (int i = 0; i < n-1; i++) {
                swapped = false;
                for (int j = 0; j < n-i-1; j++) {
                        if (a[j] > a[j+1]) {
                                int t = a[j];
                                a[j] = a[j+1];
                                a[j+1] = t;
                                swapped = true;
                        }
                }

                if (!swapped) break;
        }
}

int main() {
        int a[8192];

        for (int j=0; j<100; j++) {
                for (int i=0; i<8192; i++) {
                        a[i] = rand();
                }

                bubbleSort(a, 8192);
        }
}
geza
  • 28,403
  • 6
  • 61
  • 135
2

The answer is actually in your question:

Yet most CPUs implement signed arithmetics with defined semantics

I can't think of a CPU that you can buy today that does not use twos-compliment arithmetic for signed integers, but that wasn't always the case.

The C language was invented in 1972. Back then, IBM 7090 mainframes still existed. Not all computers were twos-compliment.

To have defined the language (and overflow behaviour) around 2s-compliment would have been prejudicial to code generation on machines that weren't.

Furthermore, as it has already been said, specifying that signed overflow is to be UB allows the compiler to produce better code, because it can discount code paths that result from signed overflow, assuming that this will never happen.

If I understand correctly that it's intended to clamp the sum of a and b to 0....INT_MAX without wraparound, I can think of two ways to write this function in a compliant way.

First, the inefficient general case that will work on all cpus:

int sum_max(int a, unsigned char b) {
    if (a > std::numeric_limits<int>::max() - b)
        return std::numeric_limits<int>::max();
    else
        return a + b;
}

Second, the surprisingly efficient 2s-compliment specific way:

int sum_max2(int a, unsigned char b) {
    unsigned int buffer;
    std::memcpy(&buffer, &a, sizeof(a));
    buffer += b;
    if (buffer > std::numeric_limits<int>::max())
        buffer = std::numeric_limits<int>::max();
    std::memcpy(&a, &buffer, sizeof(a));
    return a;
}

Resulting assembler can be seen here: https://godbolt.org/z/F42IXV

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Your 2's complement specific way has different semantics: it will return `INT_MAX` for negative values of `a` less than `-b`. – chqrlie May 09 '19 at 05:24
  • @chqrlie I must have misunderstood the function's requirements. I assumed as we were clamping to positive numbers that in input would guaranteed to be positive. – Richard Hodges May 09 '19 at 06:03
  • The silly example function only tests for positive overflow because `b` is intrinsically positive, but `a` can have a negative value. It is just an example, and your `sum_max` version is a correct way to achieve the goal in a portable manner, but I have seen legacy code use the test I posted, which was never correct but *just worked* until about 10 years ago. – chqrlie May 09 '19 at 06:08
  • As a side note, C++ just recently removed the alternative representations of signed integers. two's complement is now the only supported one. – chqrlie May 09 '19 at 06:10
  • 3
    @chqrlie That is true as of c++20, but I didn't mention it because overflowing signed integers remains UB for all the previously mentioned reasons around optimisation. I didn't want to give the wrong impression that the legacy code will become OK. It won't. – Richard Hodges May 09 '19 at 06:20