Suppose I have a bitwise expression f: int -> int
that is dependent only on the two lowest bytes of the input. In general, would it be any faster to compute (uint32)f'((uint16)x)
, where f': short -> short
? Or is there overhead involved in the conversion? Assume f
is too complicated for the compiler to figure this out on its own.

- 3,610
- 2
- 31
- 56
3 Answers
Shorter integers are not faster. Generally speaking, the fastest types have the same size as a native CPU word (so use 32-bit integers on x86, and 64-bit integers on AMD64/x64). Additionally, unsigned integers are faster than signed integers for certain operations ( performance of unsigned vs signed integers ).
Non-word-sized integers are slower than word-sized integers because the processor hardware has to provide additional padding when loading the value, and then truncate it when storing it; there can also be alignment issues (mainly when the ISA allows non-aligned value storage - though even word-sized integers can be non-aligned too).
Recent versions of C come with typdefs of known fast types, with their names indicating the maximum sized values they can accomodate, for example int_fast8_t
for the fastest type to perform octet operations on (even though it might even be a 128-bit type when compiled).
In your question, you wrote you're only performing operations on 16-bit values ( "the two lowest bytes of the input"), which means you'll want to use uint_fast16_t
. You will want uint
(for unsigned
) because the lower 16-bits of any integer (even signed
integers, like int
) represent an unsigned value.
These "fast integer" types are defined in the stdint.h
which comes with your compiler, it can be included directly with #include <stdint.h>
or indirectly via #include <inttypes.h>
. stdint.h
will also include limits.h
.
More information can be seen in the Wikibook on C:
https://en.wikibooks.org/wiki/C_Programming/stdint.h
stdint.h
is a header file in the C standard library introduced in the C99 standard library section 7.18 to allow programmers to write more portable code by providing a set oftypedef
s that specify exact-width integer types, together with the defined minimum and maximum allowable values for each type, using macroshttps://en.wikibooks.org/wiki/C_Programming/inttypes.h
Fast & fixed integer types | signed | unsigned ---------------------------+----------------+--------------- 8 bit | int_fast8_t | uint_fast8_t 16 bit | int_fast16_t | uint_fast16_t 32 bit | int_fast32_t | uint_fast32_t 64 bit | int_fast64_t | uint_fast64_t
C++
As you can see, these definitions are only mandated by C99 (not C90, nor C++03), however C++11 does improve C90 compatibility by incorporating stdint.h
as <cstdint>
(i.e. #include <cstdint>
).
Microsoft Visual Studio's C++ compiler and toolchain (as of Visual Studio 2017) is not a conforming C99 compiler (this is by-design), however as Visual C++ 2012 and higher is a conforming C++11 compiler you can use <cstdint>
as-intended - it is only Visual Studio 2010 and older (2008, 2005, etc) that lack the header.
I note that Microsoft endorses the use of the Clang toolchain with Visual Studio if you wish to compile C99 code: https://blogs.msdn.microsoft.com/vcblog/2015/12/04/clang-with-microsoft-codegen-in-vs-2015-update-1/
Java:
If you're using Java then the size of most primitive types is strictly defined (as opposed to C and C++ where they only need to be capable of storing values from within a certain minimum range as it's valid for a C compiler to use a 128-bit native integer for ushort
): https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html - I don't believe there is any way to achieve "fast" native integer arithmetic in pure Java.
C# / .NET:
In C# / .NET the story is similar to Java (as System.Byte
, Int16
, etc) are strictly defined, and in C# the short
, int
and long
keywords are always aliases for Int16
, Int32
and Int64
unlike in C where their definition is vendor-defined.
.NET does have the platform-sized System.IntPtr
type, but you cannot perform bitwise or arithmetic operations on it besides addition and subtraction (and the overhead of using the type in the first place would wipe-out any performance gains from using this type - though I note IntPtr
is not necessarily word-sized either anyway: The sizeof(void*)
does not have to equal sizeof(uint_fast32_t)
(because pointers must be large enough to store every possible valid address, yet the native word-size could be smaller, such as a CPU with a 32-bit word size, but with a 64-bit address bus, in which case sizeof(void*) == 8
but sizeof(uint_fast32_t) == 4
- and the reverse is possible: a 64-bit word machine with only a 32-bit address bus (sizeof(void*) == 4, sizeof(uint_fast32_t) == 8
).
Python, JavaScript, Ruby, PHP
These are the other 4 top languages on GitHub - they all share in common a weakly-typed design and so do not directly expose any "fast int" functionality or typing system (JavaScript doesn't even let you differentiate between float
and int
with its Number
type)... but if you want performance you wouldn't be using these languages :)
Further reading:
- What is uint_fast32_t and why should it be used instead of the regular int and uint32_t?
- Exotic architectures the standards committees care about
- What's the difference between "int" and "int_fast16_t"?
In practice:
In typical desktop software development targeting x86 and x64 you can take many things for granted, but in more exotic environments things tend to be different: consider the MIPS/NEC VR3400 (similar to the VR3400i used in the Nintendo 64), it's capable of native 64-bit operations (i.e. it has a native 64-bit word size), yet 32-bit operations are actually faster than 64-bit operations, and it supports both 32 and 40-bit address spaces - which means that had stdint.h
existed at the time (this was 1995, predating C99) the definitions for "least", "fast", and pointer integer types would be very different to x86.
-
I meant actually the opposite: the value is dependent only on the two lowest bytes. I edited the question to fix this. – Elliot Gorokhovsky Jun 09 '17 at 00:14
-
Same answer would apply, you are still filling one-register at a time for computational purposes. – David C. Rankin Jun 09 '17 at 00:16
-
unfortunately mingw defines `int_fast8_t` as `char`, so the final result might still be not as good as `int` – phuclv Jun 09 '17 at 00:42
-
116-bit integer types are *particularly* slow on x86 (when running in 32-bit or 64-bit mode, obviously) because an operand size override prefix must be prepended to every CPU instruction that works with 16-bit values. This creates a significant cost, since it bloats the code, puts pressure on the cache, increases decode time, and so forth. In fact, the effect is so significant that most of the time, compilers won't even emit code that works with 16-bit values: they'll promote them all to 32-bit values and work with them there, only truncating to store to memory. – Cody Gray - on strike Jun 09 '17 at 09:06
-
(But not all compilers perform this optimization, so you might end up writing slow code.) The only time you should use 16-bit integers is when you are declaring an array or a struct or something else where you are concerned about memory space. The same, however, is not true for 8-bit values (`unsigned char`, etc.). Using those is completely free as far as performance goes on x86. However, there is no performance *advantage* to be had by using 8-bit values instead of 32-bit values, so as your answer says, you might as well just use 32-bit types, unless you are saving space in memory. – Cody Gray - on strike Jun 09 '17 at 09:09
As a rule of thumb, never use smaller-than-int
types except for storage of large volumes of data where you only need a smaller range, and the larger type would waste lots of memory or impact cache coherency or such.
On most cpu architectures, including x86, narrow types are at best no faster, and at worst slower, than 32-bit or larger types.

- 208,859
- 35
- 376
- 711
-
Thanks. Why are they slower? Do they get padded internally? – Elliot Gorokhovsky Jun 09 '17 at 00:13
-
-
@JohnZwinck: They may be marginally slower even on modern x86 because the code to operate on them is larger (size override prefix) and will exhaust L1 instruction cache sooner. On some older generations they were significantly slower. For example on original Pentium they couldn't pipeline. – R.. GitHub STOP HELPING ICE Jun 09 '17 at 13:09
int ifun ( int x )
{
return(x-3);
}
short sfun ( short x )
{
return(x-3);
}
unsigned short ufun ( unsigned short x )
{
return(x-3);
}
this is what we are talking about
00000000 <ifun>:
0: e2400003 sub r0, r0, #3
4: e12fff1e bx lr
00000008 <sfun>:
8: e2400003 sub r0, r0, #3
c: e1a00800 lsl r0, r0, #16
10: e1a00840 asr r0, r0, #16
14: e12fff1e bx lr
00000018 <ufun>:
18: e2400003 sub r0, r0, #3
1c: e1a00800 lsl r0, r0, #16
20: e1a00820 lsr r0, r0, #16
24: e12fff1e bx lr
In order to properly honor the high level code which in the latter case is a 16 bit value on a 32 bit based target, something has to be done to clip the result. But in this case values come in as 32 bit and go out as 32 bit. so even though the input to the function may be a short the caller has clipped and sign extended the value to 32 bits so it could be just used by the next function.
another target
Disassembly of section .text:
00000000 <ifun>:
0: 03e00008 jr $31
4: 2482fffd addiu $2,$4,-3
00000008 <sfun>:
8: 2482fffd addiu $2,$4,-3
c: 00021400 sll $2,$2,0x10
10: 03e00008 jr $31
14: 00021403 sra $2,$2,0x10
00000018 <ufun>:
18: 2482fffd addiu $2,$4,-3
1c: 03e00008 jr $31
20: 3042ffff andi $2,$2,0xffff
the shorter signed value cost more Disassembly of section .text:
0000000000000000 <ifun>:
0: 8d 47 fd lea -0x3(%rdi),%eax
3: c3 retq
0000000000000010 <sfun>:
10: 8d 47 fd lea -0x3(%rdi),%eax
13: c3 retq
0000000000000020 <ufun>:
20: 8d 47 fd lea -0x3(%rdi),%eax
23: c3 retq
same instructions, the problem has to be solved elsewhere outside the function.
another target
Disassembly of section .text:
0000000000000000 <ifun>:
0: 51000c00 sub w0, w0, #0x3
4: d65f03c0 ret
0000000000000008 <sfun>:
8: 13003c00 sxth w0, w0
c: 51000c00 sub w0, w0, #0x3
10: d65f03c0 ret
0000000000000018 <ufun>:
18: 53003c00 uxth w0, w0
1c: 51000c00 sub w0, w0, #0x3
20: d65f03c0 ret
shorter cost more
a 16 bit target though
00000000 <_ifun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 65c0 fffd add $-3, r0
c: 1585 mov (sp)+, r5
e: 0087 rts pc
00000010 <_sfun>:
10: 1166 mov r5, -(sp)
12: 1185 mov sp, r5
14: 1d40 0004 mov 4(r5), r0
18: 65c0 fffd add $-3, r0
1c: 1585 mov (sp)+, r5
1e: 0087 rts pc
00000020 <_ufun>:
20: 1166 mov r5, -(sp)
22: 1185 mov sp, r5
24: 1d40 0004 mov 4(r5), r0
28: 65c0 fffd add $-3, r0
2c: 1585 mov (sp)+, r5
2e: 0087 rts pc
the compiler considers int and short the same so we dont incur a cost, as we hoped, expected, get close or match the register size...
same for this 16 bit target
Disassembly of section .text:
00000000 <ifun>:
0: 03 97 sbiw r24, 0x03 ; 3
2: 08 95 ret
00000004 <sfun>:
4: 03 97 sbiw r24, 0x03 ; 3
6: 08 95 ret
00000008 <ufun>:
8: 03 97 sbiw r24, 0x03 ; 3
a: 08 95 ret
but dont match the target size and
long int ifun ( int x )
{
return(x-3);
}
short sfun ( short x )
{
return(x-3);
}
as expected you incur a penalty.
00000000 <ifun>:
0: 03 97 sbiw r24, 0x03 ; 3
2: 79 2f mov r23, r25
4: 68 2f mov r22, r24
6: 99 0f add r25, r25
8: 88 0b sbc r24, r24
a: 99 0b sbc r25, r25
c: 08 95 ret
0000000e <sfun>:
e: 03 97 sbiw r24, 0x03 ; 3
10: 08 95 ret
although it is up to the author of the backend, there is a reason why the native int variable type is there and what the spec, etc talks about it. folks like to avoid that these days and use stdint for some reason, but at a cost. yes you can lose portability by using native types. performance, portability, maintenance, pick one or two you cant have all three.
Just examples of what Dai was saying, upvote that answer. You were asking x86 specifically and in that case there was a sign penalty vs a size penalty. Except for examples like this, I personally use unsigned unless there is some absolute reason I have to use signed...
EDIT
Bitwise helps greatly as does x86 to some extent (to reduce or erase the performance difference). Just demonstrating the general concept (above) that smaller is not better if smaller is relative to the size of the primary operations/registers for that instruction set. For an 8 or 16 bit machine, absolutely, more work is required if you use 32 bit variables even for bitwise operations. Purely bitwise operations, I am trying to think of cases where the compiler would generate something different between native size and smaller, something that would matter.
When in doubt just try it and see (compile then disassemble is sometimes all you need).
x86 which you tagged, has some benefits over others, would need to do more research because with these simple examples the problem is being pushed off to code outside the function. Can still make some differences though:
unsigned int ufun ( unsigned int x, unsigned int y )
{
return(x*y)+3;
}
unsigned char ucfun ( unsigned char x, unsigned char y )
{
return(x*y)+3;
}
0000000000000000 <ufun>:
0: 0f af fe imul %esi,%edi
3: 8d 47 03 lea 0x3(%rdi),%eax
6: c3 retq
0000000000000010 <ucfun>:
10: 89 f0 mov %esi,%eax
12: 0f af c7 imul %edi,%eax
15: 83 c0 03 add $0x3,%eax
18: c3 retq
so cant make the generalization that it is neither slower nor faster on x86.
EDIT2
I cant decide if this is a fair bitwise 32/16 bit comparison
typedef struct
{
unsigned hello:5;
} MYSTRUCT;
unsigned int fun1 ( unsigned int x, MYSTRUCT m )
{
return(x<<m.hello);
}
unsigned short fun2 ( unsigned short x, MYSTRUCT m )
{
return(x<<m.hello);
}
00000000 <fun1>:
0: e201101f and r1, r1, #31
4: e1a00110 lsl r0, r0, r1
8: e12fff1e bx lr
0000000c <fun2>:
c: e201101f and r1, r1, #31
10: e1a00110 lsl r0, r0, r1
14: e1a00800 lsl r0, r0, #16
18: e1a00820 lsr r0, r0, #16
1c: e12fff1e bx lr
hmm, yeah okay that was fair.
typedef struct
{
unsigned hello:3;
} MYSTRUCT;
unsigned int fun1 ( unsigned int x, MYSTRUCT m )
{
return(x<<m.hello);
}
unsigned short fun2 ( unsigned short x, MYSTRUCT m )
{
return(x<<m.hello);
}
Disassembly of section .text:
00000000 <fun1>:
0: e2011007 and r1, r1, #7
4: e1a00110 lsl r0, r0, r1
8: e12fff1e bx lr
0000000c <fun2>:
c: e2011007 and r1, r1, #7
10: e1a00110 lsl r0, r0, r1
14: e1a00800 lsl r0, r0, #16
18: e1a00820 lsr r0, r0, #16
1c: e12fff1e bx lr
Interesting x86 incurs a penalty...
0000000000000000 <fun1>:
0: 89 f1 mov %esi,%ecx
2: 89 f8 mov %edi,%eax
4: 83 e1 07 and $0x7,%ecx
7: d3 e0 shl %cl,%eax
9: c3 retq
0000000000000010 <fun2>:
10: 89 f1 mov %esi,%ecx
12: 0f b7 c7 movzwl %di,%eax
15: 83 e1 07 and $0x7,%ecx
18: d3 e0 shl %cl,%eax
1a: c3 retq
okay nevermind arm masked the bits off as well, the shorter variable took an extra byte of machine code which gets into that why use xor rax,rax to zero a variable discussion...less machine code is cheaper and measurable but you may have to work pretty hard to do it (or not depends on how tricky you are)

- 69,149
- 8
- 89
- 168
-
For multiple cases (for multiple targets) you've included irrelevant padding that isn't executed after function's end. One example of this is the 64-bit 80x86 code, where all 3 functions are identical and used 2 instructions. – Brendan Jun 09 '17 at 02:32