Performance difference of signed and unsigned integers of non-native length

Question

There is this talk, CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior...", where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1 as an index. On a 64-bit system the array access block[i1] will then do *(block + i1). The issue is that block is a 64-bit pointer whereas i1 is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.

I would like to also show this with a simple example. So I have tried the ++i code with various signed and unsigned integers. The following is my test code:

#include <cstdint>

void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }

void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }

void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }

void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; }

With g++ -c test.cpp and objdump -d test.o I get assembly listings like this:

000000000000004e <_Z10test_int32v>:
  4e:   55                      push   %rbp
  4f:   48 89 e5                mov    %rsp,%rbp
  52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  5d:   90                      nop
  5e:   5d                      pop    %rbp
  5f:   c3                      retq

To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.

The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:

int8_t:

   4:   c6 45 ff 00             movb   $0x0,-0x1(%rbp)
   8:   0f b6 45 ff             movzbl -0x1(%rbp),%eax
   c:   83 c0 01                add    $0x1,%eax
   f:   88 45 ff                mov    %al,-0x1(%rbp)

uint8_t:

  19:   c6 45 ff 00             movb   $0x0,-0x1(%rbp)
  1d:   80 45 ff 01             addb   $0x1,-0x1(%rbp)

int16_t:

  28:   66 c7 45 fe 00 00       movw   $0x0,-0x2(%rbp)
  2e:   0f b7 45 fe             movzwl -0x2(%rbp),%eax
  32:   83 c0 01                add    $0x1,%eax
  35:   66 89 45 fe             mov    %ax,-0x2(%rbp)

uint16_t:

  40:   66 c7 45 fe 00 00       movw   $0x0,-0x2(%rbp)
  46:   66 83 45 fe 01          addw   $0x1,-0x2(%rbp)

int32_t:

  52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)

uint32_t:

  64:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  6b:   83 45 fc 01             addl   $0x1,-0x4(%rbp)

int64_t:

  76:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
  7d:   00 
  7e:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)

uint64_t:

  8a:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
  91:   00 
  92:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)

Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.

But for int8_t we load a byte (movb) into %rbp, then load and zero-pad it to a long (movzbl) into the accumulator %eax. The addition (add) is performed without any size specification because the overflow is not defined anyway. The unsigned version directly uses instructions for bytes.

Either both add and addb/addw/addl/addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.

I have looked for a table with latencies and found the one by agner.org. There for each CPU (using Sandy Bridge here) there is only one entry for ADD but I do not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add instruction.

Does this mean that on x86 the ++i of non-native length integers is actually faster for unsigned types because there are less instructions?

One thing to note is that those functions have no side effects and return nothing, so the compiler could easily optimize it to nothing ([gcc does at `-Og`](https://godbolt.org/g/taqP9A)). You should probably take the integer as an argument and return the result. — Justin, Apr 11 '18 at 19:01
Are you building without optimization? If you enable it, are the result different? Performance measuring and benchmarking should always be done on optimized code. — Some programmer dude, Apr 11 '18 at 19:02
for such tests it is important to use at least -O2 optimization level together with being clear about -march, -mtune switches — Severin Pappadeux, Apr 11 '18 at 19:03
And counting instructions on architectures like x86 or ARM is pointless. More instructions could result in faster execution. Instead profile your code **iff** you have a performance problem and optimise hotspots. — too honest for this site, Apr 11 '18 at 19:35
The lack of a size suffix on the `add` instruction just means that the size is implied by the operands. It doesn't mean that overflow is undefined. Overflow is always defined at the processor level. — Myria, Apr 11 '18 at 19:35
@Olaf: It's not about *hand-written* asm, but it's very much about the assembly output of compilers. People that know C++ and assembly are much better prepared to answer this question than just C++. Unfortunately the question is based on such an incorrect experiment that it's just totally bogus. There's no mixing of integers with pointers here to even give the compiler room to optimize anything, and is all about the anti-optimized `-O0` debug mode code. — Peter Cordes, Apr 11 '18 at 20:01
@PeterCordes: That's what the machine-tags are for. There is no requirement the compiler outputs Assembly code at all. Nor does the CPU execute Assembly language. — too honest for this site, Apr 11 '18 at 20:07
@Olaf: The answer here is basically the same for other architectures, especially ones that have 32 and 64-bit operand sizes. x86-64 is just one specific case. For compilers that can output machine code directly, humans will still look at disassembly, or ask the compiler to produce asm instead of machine code (like with `clang -O3 -S` or whatever option MSVC has). Other questions like this are correctly tagged `[assembly]`. Having gold badges in [assembly], [c++], and [x86], I think I'm qualified to judge when one or both are appropriate. — Peter Cordes, Apr 11 '18 at 20:20
@PeterCordes: "The answer here is basically the same for other architectures, especially ones that have 32 and 64-bit operand sizes." - That's not really correct. THere are some architectures which do have different performance for multi-word arithmetic. Namely for ARM there are different variants possible. I agree, though the question is very poorly researched and production code is rarely that straight. That's why the question is in fact off-topic. I can't&won't judge about your expertise, but I know about mine well enough. — too honest for this site, Apr 11 '18 at 20:29
The first paragraph isn't an accurate summary of Chandler's explanation, so the question seems predicated on a misconception. — Adrian McCarthy, Apr 11 '18 at 23:43

Myria · Accepted Answer · 2018-04-11T20:23:33.687

There are two parts of this question: Chandler's point about optimizations based on overflow being undefined, and the differences you found in the assembly output.

Chandler's point is that if overflow is undefined behavior, then the compiler can assume that it cannot happen. Consider the following code:

typedef int T;
void CopyInts(int *dest, const int *src) {
    T x = 0;
    for (; src[x]; ++x) {
        dest[x] = src[x];
    }
}

Here, the compiler can safely change the for loop to the following:

    while (*src) {
        *dest++ = *src++;
    }

That's because the compiler does not have to worry about the case that x overflows. If the compiler has to worry about x overflowing, the source and destination pointers suddenly get 16 GB subtracted from them, so the simple transformation above will not work.

At the assembly level, the above is (with GCC 7.3.0 for x86-64, -O2):

_Z8CopyIntsPiPKii:
  movl (%rsi), %edx
  testl %edx, %edx
  je .L1
  xorl %eax, %eax
.L3:
  movl %edx, (%rdi,%rax)
  addq $4, %rax
  movl (%rsi,%rax), %edx
  testl %edx, %edx
  jne .L3
.L1:
  rep ret

If we change T to be unsigned int, we get this slower code instead:

_Z8CopyIntsPiPKij:
  movl (%rsi), %eax
  testl %eax, %eax
  je .L1
  xorl %edx, %edx
  xorl %ecx, %ecx
.L3:
  movl %eax, (%rdi,%rcx)
  leal 1(%rdx), %eax
  movq %rax, %rdx
  leaq 0(,%rax,4), %rcx
  movl (%rsi,%rax,4), %eax
  testl %eax, %eax
  jne .L3
.L1:
  rep ret

Here, the compiler is keeping x as a separate variable so that overflow is handled properly.

Instead of relying on signed overflow being undefined for performance, you can use a size type that is the same size as a pointer. This means that such a variable could only overflow at the same time as a pointer, which is also undefined. Hence, at least for x86-64, size_t would also work as T to get the better performance.

Now for the second part of your question: the add instruction. The suffixes on the add instruction are from the so-called "AT&T" style of x86 assembly language. In AT&T assembly language, the parameters are backward from the way Intel writes instructions, and disambiguating instruction sizes is done by adding a suffix to the mnemonic instead of something like dword ptr in the Intel case.

Example:

Intel: add dword ptr [eax], 1

AT&T: addl $1, (%eax)

These are the same instruction, just written differently. The l takes the place of dword ptr.

In the case where the suffix is missing from AT&T instructions, it's because it's not required: the size is implicit from the operands.

add $1, %eax

The l suffix is unnecessary because the instruction is obviously 32-bit, because eax is.

In short, it has nothing to do with overflow. Overflow is always defined at the processor level. On some architectures, such as when using the non-u instructions on MIPS, overflow throws an exception, but it's still defined. C/C++ are the only major languages that make overflow unpredictable behavior.

they made it undefined because [they allow other signed formats](https://stackoverflow.com/q/18195715/995714), not only 2's complement — phuclv, Apr 12 '18 at 02:19
Note that in this case, the compiler is just doing a bad job. It could have used scaled-index addressing modes for the load and store, and `inc %eax` instead of `inc %rax`, to correctly implement the exact behaviour of the C++ source. `leaq 0(,%rax,4), %rcx` to set up for the next iterations is particularly braindead. But yes, the extra requirement to wrap the index constrains the compiler in a way that makes it shoot itself in the foot here. Notice that the `int` version is still using indexed addressing modes, but the index is 64-bit, and scaled up by 4 to a byte offset. — Peter Cordes, Apr 12 '18 at 03:02
Suggestion: you could show the asm output for `gcc -O2 -fwrapv`, which makes signed integer overflow well-defined as wrapping. — Peter Cordes, Apr 13 '18 at 10:22

score 2 · Answer 2 · answered Apr 13 '18 at 10:22

Either both add and addb/addw/addl/addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.

First of all, it's a 64-bit adder because it supports qword add with the same performance.

In hardware, masking bits doesn't take a whole extra clock cycle; one clock cycle is many gate-delays long. An enable/disable control signal can zero the results from the high half (for 32-bit operand size), or stop carry-propagation at 16 or 8 bits (for smaller operand-sizes that leave the upper bits unmodified instead of zero-extending).

So each execution port with an integer ALU execution unit probably uses the same adder transistors for all operand sizes, using control signals to modify its behaviour. Maybe even using it for XOR as well (by blocking all the carry signals).

I was going to write more about your misunderstanding of the optimization issue, but Myria already covered it.

See also What Every C Programmer Should Know About Undefined Behavior, an LLVM blog post that explains some of the ways UB allows optimization, including specifically promoting a counter to 64-bit or optimizing it away into pointer increments, instead of implementing signed wrap-around like you'd get if signed integer overflow was strictly defined as wrapping. (e.g. if you compile with gcc -fwrapv, the opposite of -fstrict-overflow)

Your un-optimized compiler output is pointless and doesn't tell us anything. The x86 add instruction implements both unsigned and signed-2's-complement addition, because those are both the same binary operation. The different code-gen at -O0 is merely an artefact of compiler internals, not anything fundamental that would happen in real code (with -O2 or -O3).

Performance difference of signed and unsigned integers of non-native length

2 Answers2

Linked