There is this talk, CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior...", where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1
as an index. On a 64-bit system the array access block[i1]
will then do *(block + i1)
. The issue is that block
is a 64-bit pointer whereas i1
is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.
I would like to also show this with a simple example. So I have tried the ++i
code with various signed and unsigned integers. The following is my test code:
#include <cstdint>
void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }
void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }
void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }
void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; }
With g++ -c test.cpp
and objdump -d test.o
I get assembly listings like
this:
000000000000004e <_Z10test_int32v>:
4e: 55 push %rbp
4f: 48 89 e5 mov %rsp,%rbp
52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
59: 83 45 fc 01 addl $0x1,-0x4(%rbp)
5d: 90 nop
5e: 5d pop %rbp
5f: c3 retq
To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.
The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:
int8_t
:4: c6 45 ff 00 movb $0x0,-0x1(%rbp) 8: 0f b6 45 ff movzbl -0x1(%rbp),%eax c: 83 c0 01 add $0x1,%eax f: 88 45 ff mov %al,-0x1(%rbp)
uint8_t
:19: c6 45 ff 00 movb $0x0,-0x1(%rbp) 1d: 80 45 ff 01 addb $0x1,-0x1(%rbp)
int16_t
:28: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 2e: 0f b7 45 fe movzwl -0x2(%rbp),%eax 32: 83 c0 01 add $0x1,%eax 35: 66 89 45 fe mov %ax,-0x2(%rbp)
uint16_t
:40: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 46: 66 83 45 fe 01 addw $0x1,-0x2(%rbp)
int32_t
:52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 59: 83 45 fc 01 addl $0x1,-0x4(%rbp)
uint32_t
:64: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 6b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
int64_t
:76: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 7d: 00 7e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
uint64_t
:8a: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 91: 00 92: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.
But for int8_t
we load a byte (movb
) into %rbp
, then load and zero-pad it
to a long (movzbl
) into the accumulator %eax
. The addition (add
) is
performed without any size specification because the overflow is not defined
anyway. The unsigned version directly uses instructions for bytes.
Either both add
and addb
/addw
/addl
/addq
take the same number of
cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all
sizes or the 32-bit unit does the masking internally and therefore has a longer
latency.
I have looked for a table with latencies and found the one by
agner.org. There for
each CPU (using Sandy Bridge here) there is only one entry for ADD
but I do
not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add
instruction.
Does this mean that on x86 the ++i
of non-native length integers is actually
faster for unsigned types because there are less instructions?