Should I truncate my ints to shorts before computing bitwise ops on them?

Question

Suppose I have a bitwise expression f: int -> int that is dependent only on the two lowest bytes of the input. In general, would it be any faster to compute (uint32)f'((uint16)x), where f': short -> short? Or is there overhead involved in the conversion? Assume f is too complicated for the compiler to figure this out on its own.

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

Shorter integers are not faster. Generally speaking, the fastest types have the same size as a native CPU word (so use 32-bit integers on x86, and 64-bit integers on AMD64/x64). Additionally, unsigned integers are faster than signed integers for certain operations ( performance of unsigned vs signed integers ).

Non-word-sized integers are slower than word-sized integers because the processor hardware has to provide additional padding when loading the value, and then truncate it when storing it; there can also be alignment issues (mainly when the ISA allows non-aligned value storage - though even word-sized integers can be non-aligned too).

Recent versions of C come with typdefs of known fast types, with their names indicating the maximum sized values they can accomodate, for example int_fast8_t for the fastest type to perform octet operations on (even though it might even be a 128-bit type when compiled).

In your question, you wrote you're only performing operations on 16-bit values ( "the two lowest bytes of the input"), which means you'll want to use uint_fast16_t. You will want uint (for unsigned) because the lower 16-bits of any integer (even signed integers, like int) represent an unsigned value.

These "fast integer" types are defined in the stdint.h which comes with your compiler, it can be included directly with #include <stdint.h> or indirectly via #include <inttypes.h>. stdint.h will also include limits.h.

More information can be seen in the Wikibook on C:

https://en.wikibooks.org/wiki/C_Programming/stdint.h

stdint.h is a header file in the C standard library introduced in the C99 standard library section 7.18 to allow programmers to write more portable code by providing a set of typedefs that specify exact-width integer types, together with the defined minimum and maximum allowable values for each type, using macros

https://en.wikibooks.org/wiki/C_Programming/inttypes.h
Fast & fixed integer types | signed         | unsigned
---------------------------+----------------+---------------
 8 bit                     | int_fast8_t    | uint_fast8_t
16 bit                     | int_fast16_t   | uint_fast16_t
32 bit                     | int_fast32_t   | uint_fast32_t
64 bit                     | int_fast64_t   | uint_fast64_t

C++

As you can see, these definitions are only mandated by C99 (not C90, nor C++03), however C++11 does improve C90 compatibility by incorporating stdint.h as <cstdint> (i.e. #include <cstdint>).

Microsoft Visual Studio's C++ compiler and toolchain (as of Visual Studio 2017) is not a conforming C99 compiler (this is by-design), however as Visual C++ 2012 and higher is a conforming C++11 compiler you can use <cstdint> as-intended - it is only Visual Studio 2010 and older (2008, 2005, etc) that lack the header.

I note that Microsoft endorses the use of the Clang toolchain with Visual Studio if you wish to compile C99 code: https://blogs.msdn.microsoft.com/vcblog/2015/12/04/clang-with-microsoft-codegen-in-vs-2015-update-1/

Java:

If you're using Java then the size of most primitive types is strictly defined (as opposed to C and C++ where they only need to be capable of storing values from within a certain minimum range as it's valid for a C compiler to use a 128-bit native integer for ushort): https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html - I don't believe there is any way to achieve "fast" native integer arithmetic in pure Java.

C# / .NET:

In C# / .NET the story is similar to Java (as System.Byte, Int16, etc) are strictly defined, and in C# the short, int and long keywords are always aliases for Int16, Int32 and Int64 unlike in C where their definition is vendor-defined.

.NET does have the platform-sized System.IntPtr type, but you cannot perform bitwise or arithmetic operations on it besides addition and subtraction (and the overhead of using the type in the first place would wipe-out any performance gains from using this type - though I note IntPtr is not necessarily word-sized either anyway: The sizeof(void*) does not have to equal sizeof(uint_fast32_t) (because pointers must be large enough to store every possible valid address, yet the native word-size could be smaller, such as a CPU with a 32-bit word size, but with a 64-bit address bus, in which case sizeof(void*) == 8 but sizeof(uint_fast32_t) == 4 - and the reverse is possible: a 64-bit word machine with only a 32-bit address bus (sizeof(void*) == 4, sizeof(uint_fast32_t) == 8).

Python, JavaScript, Ruby, PHP

These are the other 4 top languages on GitHub - they all share in common a weakly-typed design and so do not directly expose any "fast int" functionality or typing system (JavaScript doesn't even let you differentiate between float and int with its Number type)... but if you want performance you wouldn't be using these languages :)

In practice:

In typical desktop software development targeting x86 and x64 you can take many things for granted, but in more exotic environments things tend to be different: consider the MIPS/NEC VR3400 (similar to the VR3400i used in the Nintendo 64), it's capable of native 64-bit operations (i.e. it has a native 64-bit word size), yet 32-bit operations are actually faster than 64-bit operations, and it supports both 32 and 40-bit address spaces - which means that had stdint.h existed at the time (this was 1995, predating C99) the definitions for "least", "fast", and pointer integer types would be very different to x86.

I meant actually the opposite: the value is dependent only on the two lowest bytes. I edited the question to fix this. — Elliot Gorokhovsky, Jun 09 '17 at 00:14
Same answer would apply, you are still filling one-register at a time for computational purposes. — David C. Rankin, Jun 09 '17 at 00:16
unfortunately mingw defines `int_fast8_t` as `char`, so the final result might still be not as good as `int` — phuclv, Jun 09 '17 at 00:42
16-bit integer types are *particularly* slow on x86 (when running in 32-bit or 64-bit mode, obviously) because an operand size override prefix must be prepended to every CPU instruction that works with 16-bit values. This creates a significant cost, since it bloats the code, puts pressure on the cache, increases decode time, and so forth. In fact, the effect is so significant that most of the time, compilers won't even emit code that works with 16-bit values: they'll promote them all to 32-bit values and work with them there, only truncating to store to memory. — Cody Gray - on strike, Jun 09 '17 at 09:06
(But not all compilers perform this optimization, so you might end up writing slow code.) The only time you should use 16-bit integers is when you are declaring an array or a struct or something else where you are concerned about memory space. The same, however, is not true for 8-bit values (`unsigned char`, etc.). Using those is completely free as far as performance goes on x86. However, there is no performance *advantage* to be had by using 8-bit values instead of 32-bit values, so as your answer says, you might as well just use 32-bit types, unless you are saving space in memory. — Cody Gray - on strike, Jun 09 '17 at 09:09

score 5 · Answer 2 · answered Jun 09 '17 at 00:12

5

As a rule of thumb, never use smaller-than-int types except for storage of large volumes of data where you only need a smaller range, and the larger type would waste lots of memory or impact cache coherency or such.

On most cpu architectures, including x86, narrow types are at best no faster, and at worst slower, than 32-bit or larger types.

answered Jun 09 '17 at 00:12

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Thanks. Why are they slower? Do they get padded internally? – Elliot Gorokhovsky Jun 09 '17 at 00:13
@RenéG: They're neither slower nor faster on x86. – John Zwinck Jun 09 '17 at 00:14
@JohnZwinck: They may be marginally slower even on modern x86 because the code to operate on them is larger (size override prefix) and will exhaust L1 instruction cache sooner. On some older generations they were significantly slower. For example on original Pentium they couldn't pipeline. – R.. GitHub STOP HELPING ICE Jun 09 '17 at 13:09

old_timer · Answer 3 · 2017-06-09T03:31:27.087

int ifun ( int x )
{
    return(x-3);
}

short sfun ( short x )
{
    return(x-3);
}

unsigned short ufun ( unsigned short x )
{
    return(x-3);
}

this is what we are talking about

00000000 <ifun>:
   0:   e2400003    sub r0, r0, #3
   4:   e12fff1e    bx  lr

00000008 <sfun>:
   8:   e2400003    sub r0, r0, #3
   c:   e1a00800    lsl r0, r0, #16
  10:   e1a00840    asr r0, r0, #16
  14:   e12fff1e    bx  lr

00000018 <ufun>:
  18:   e2400003    sub r0, r0, #3
  1c:   e1a00800    lsl r0, r0, #16
  20:   e1a00820    lsr r0, r0, #16
  24:   e12fff1e    bx  lr

In order to properly honor the high level code which in the latter case is a 16 bit value on a 32 bit based target, something has to be done to clip the result. But in this case values come in as 32 bit and go out as 32 bit. so even though the input to the function may be a short the caller has clipped and sign extended the value to 32 bits so it could be just used by the next function.

another target

Disassembly of section .text:

00000000 <ifun>:
   0:   03e00008    jr  $31
   4:   2482fffd    addiu   $2,$4,-3

00000008 <sfun>:
   8:   2482fffd    addiu   $2,$4,-3
   c:   00021400    sll $2,$2,0x10
  10:   03e00008    jr  $31
  14:   00021403    sra $2,$2,0x10

00000018 <ufun>:
  18:   2482fffd    addiu   $2,$4,-3
  1c:   03e00008    jr  $31
  20:   3042ffff    andi    $2,$2,0xffff

the shorter signed value cost more Disassembly of section .text:

0000000000000000 <ifun>:
   0:   8d 47 fd                lea    -0x3(%rdi),%eax
   3:   c3                      retq   

0000000000000010 <sfun>:
  10:   8d 47 fd                lea    -0x3(%rdi),%eax
  13:   c3                      retq   

0000000000000020 <ufun>:
  20:   8d 47 fd                lea    -0x3(%rdi),%eax
  23:   c3                      retq

same instructions, the problem has to be solved elsewhere outside the function.

another target

Disassembly of section .text:

0000000000000000 <ifun>:
   0:   51000c00    sub w0, w0, #0x3
   4:   d65f03c0    ret

0000000000000008 <sfun>:
   8:   13003c00    sxth    w0, w0
   c:   51000c00    sub w0, w0, #0x3
  10:   d65f03c0    ret

0000000000000018 <ufun>:
  18:   53003c00    uxth    w0, w0
  1c:   51000c00    sub w0, w0, #0x3
  20:   d65f03c0    ret

shorter cost more

a 16 bit target though

00000000 <_ifun>:
   0:   1166            mov r5, -(sp)
   2:   1185            mov sp, r5
   4:   1d40 0004       mov 4(r5), r0
   8:   65c0 fffd       add $-3, r0
   c:   1585            mov (sp)+, r5
   e:   0087            rts pc

00000010 <_sfun>:
  10:   1166            mov r5, -(sp)
  12:   1185            mov sp, r5
  14:   1d40 0004       mov 4(r5), r0
  18:   65c0 fffd       add $-3, r0
  1c:   1585            mov (sp)+, r5
  1e:   0087            rts pc

00000020 <_ufun>:
  20:   1166            mov r5, -(sp)
  22:   1185            mov sp, r5
  24:   1d40 0004       mov 4(r5), r0
  28:   65c0 fffd       add $-3, r0
  2c:   1585            mov (sp)+, r5
  2e:   0087            rts pc

the compiler considers int and short the same so we dont incur a cost, as we hoped, expected, get close or match the register size...

same for this 16 bit target

Disassembly of section .text:

00000000 <ifun>:
   0:   03 97           sbiw    r24, 0x03   ; 3
   2:   08 95           ret

00000004 <sfun>:
   4:   03 97           sbiw    r24, 0x03   ; 3
   6:   08 95           ret

00000008 <ufun>:
   8:   03 97           sbiw    r24, 0x03   ; 3
   a:   08 95           ret

but dont match the target size and

long int ifun ( int x )
{
    return(x-3);
}

short sfun ( short x )
{
    return(x-3);
}

as expected you incur a penalty.

00000000 <ifun>:
   0:   03 97           sbiw    r24, 0x03   ; 3
   2:   79 2f           mov r23, r25
   4:   68 2f           mov r22, r24
   6:   99 0f           add r25, r25
   8:   88 0b           sbc r24, r24
   a:   99 0b           sbc r25, r25
   c:   08 95           ret

0000000e <sfun>:
   e:   03 97           sbiw    r24, 0x03   ; 3
  10:   08 95           ret

although it is up to the author of the backend, there is a reason why the native int variable type is there and what the spec, etc talks about it. folks like to avoid that these days and use stdint for some reason, but at a cost. yes you can lose portability by using native types. performance, portability, maintenance, pick one or two you cant have all three.

Just examples of what Dai was saying, upvote that answer. You were asking x86 specifically and in that case there was a sign penalty vs a size penalty. Except for examples like this, I personally use unsigned unless there is some absolute reason I have to use signed...

EDIT

Bitwise helps greatly as does x86 to some extent (to reduce or erase the performance difference). Just demonstrating the general concept (above) that smaller is not better if smaller is relative to the size of the primary operations/registers for that instruction set. For an 8 or 16 bit machine, absolutely, more work is required if you use 32 bit variables even for bitwise operations. Purely bitwise operations, I am trying to think of cases where the compiler would generate something different between native size and smaller, something that would matter.

When in doubt just try it and see (compile then disassemble is sometimes all you need).

x86 which you tagged, has some benefits over others, would need to do more research because with these simple examples the problem is being pushed off to code outside the function. Can still make some differences though:

unsigned int ufun ( unsigned int x, unsigned int y )
{
    return(x*y)+3;
}

unsigned char ucfun ( unsigned char x, unsigned char y )
{
    return(x*y)+3;
}


0000000000000000 <ufun>:
   0:   0f af fe                imul   %esi,%edi
   3:   8d 47 03                lea    0x3(%rdi),%eax
   6:   c3                      retq   

0000000000000010 <ucfun>:
  10:   89 f0                   mov    %esi,%eax
  12:   0f af c7                imul   %edi,%eax
  15:   83 c0 03                add    $0x3,%eax
  18:   c3                      retq

so cant make the generalization that it is neither slower nor faster on x86.

EDIT2

I cant decide if this is a fair bitwise 32/16 bit comparison

typedef struct
{
    unsigned hello:5;
} MYSTRUCT;


unsigned int fun1 ( unsigned int x, MYSTRUCT m )
{
    return(x<<m.hello);
}

unsigned short fun2 ( unsigned short x, MYSTRUCT m )
{
    return(x<<m.hello);
}

00000000 <fun1>:
   0:   e201101f    and r1, r1, #31
   4:   e1a00110    lsl r0, r0, r1
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e201101f    and r1, r1, #31
  10:   e1a00110    lsl r0, r0, r1
  14:   e1a00800    lsl r0, r0, #16
  18:   e1a00820    lsr r0, r0, #16
  1c:   e12fff1e    bx  lr

hmm, yeah okay that was fair.

typedef struct
{
    unsigned hello:3;
} MYSTRUCT;


unsigned int fun1 ( unsigned int x, MYSTRUCT m )
{
    return(x<<m.hello);
}

unsigned short fun2 ( unsigned short x, MYSTRUCT m )
{
    return(x<<m.hello);
}

Disassembly of section .text:

00000000 <fun1>:
   0:   e2011007    and r1, r1, #7
   4:   e1a00110    lsl r0, r0, r1
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e2011007    and r1, r1, #7
  10:   e1a00110    lsl r0, r0, r1
  14:   e1a00800    lsl r0, r0, #16
  18:   e1a00820    lsr r0, r0, #16
  1c:   e12fff1e    bx  lr

Interesting x86 incurs a penalty...

0000000000000000 <fun1>:
   0:   89 f1                   mov    %esi,%ecx
   2:   89 f8                   mov    %edi,%eax
   4:   83 e1 07                and    $0x7,%ecx
   7:   d3 e0                   shl    %cl,%eax
   9:   c3                      retq   

0000000000000010 <fun2>:
  10:   89 f1                   mov    %esi,%ecx
  12:   0f b7 c7                movzwl %di,%eax
  15:   83 e1 07                and    $0x7,%ecx
  18:   d3 e0                   shl    %cl,%eax
  1a:   c3                      retq

okay nevermind arm masked the bits off as well, the shorter variable took an extra byte of machine code which gets into that why use xor rax,rax to zero a variable discussion...less machine code is cheaper and measurable but you may have to work pretty hard to do it (or not depends on how tricky you are)

For multiple cases (for multiple targets) you've included irrelevant padding that isn't executed after function's end. One example of this is the 64-bit 80x86 code, where all 3 functions are identical and used 2 instructions. — Brendan, Jun 09 '17 at 02:32

Should I truncate my ints to shorts before computing bitwise ops on them?

3 Answers3

C++

Java:

C# / .NET:

Python, JavaScript, Ruby, PHP

Further reading:

In practice: