1

I want to load 16 bit unsigned integers from an array and use these values for 32 bit unsigned calculations in C++. I have the choice between storing the values as 16 bit array (less memory) or 32 bit array (more memory consumption).

My code should be compilable with common C++ compilers and run on as many architectures as possible. It will be difficult to do performance measurement and assembler code reading for many of these combinations, so I am asking for a theoretical examination.

In other words: Under which conditions does a 16 bit to 32 bit unsigned integer conversion usually consume CPU cycles? When can I expect to use the memory reduced 16 bit array without loosing CPU cycles?

Silicomancer
  • 8,604
  • 10
  • 63
  • 130
  • Generally it is best to rely on compiler vendors to do this research for you across run time targets. Have a look at the `fast_t` and `least_t` types at https://en.cppreference.com/w/cpp/types/integer – Richard Mar 12 '19 at 00:03
  • 1
    Conversion between short and int require *no* CPU cycle. Just load it from memory with the proper type and it will be in your register with the register size. It may even be faster to use shorts because memory bandwidth is reduced. – Alain Merigot Mar 12 '19 at 00:47
  • 2
    @Richard: The `fast` types aren't good, there's little agreement of what `uint_fast16_t` s fast *for*; e.g. **on x86-64 System V `uint_fast16_t` is a 64-bit type**, perhaps to avoid extra zero- / sign-extension instructions if it was a function arg or return value (already in a register), when indexing an array with it. But making an array of `uint_fast16_t` would waste lots of cache footprint and have nearly zero benefit because loads can already zero-extend for free. `unsigned short` is your best bet; a hypothetical machine with slow 16-bit might have 32-bit short but 16-bit `uint_least16_t` – Peter Cordes Mar 12 '19 at 01:27
  • @AlainMerigot that'll be different if the values are in registers, where some sign/zero extension or truncation would be needed – phuclv Mar 12 '19 at 02:11
  • @Peter: I agree. Using fast types doesn't help. My question is about saving memory without loosing CPU cycles. The fast types probably will not save memory, they may even waste some. – Silicomancer Mar 12 '19 at 08:21
  • @silocomancer You were asking for code to be compatible with common C++ compilers and run on as many architectures as possible. Portability is often the antithesis of performance and requires a decision. Sometimes performance features conflict as well. If you truly want to write portable code you should be aware of the contents of this question and its answer: https://stackoverflow.com/questions/8500677/what-is-uint-fast32-t-and-why-should-it-be-used-instead-of-the-regular-int-and-u – Richard Mar 13 '19 at 21:31

1 Answers1

2

I think all major architectures support loads from memory with sign extension and zero extension. x86, ARM and MIPS definitely do have such load instructions. Old architectures and primitive microcontrollers, especially 8-bit and 16-bit ones, may not have such instructions and therefore may require multiple instructions to achieve the same result. If you aren't mentioning those, you probably don't really care. So, just write portable C/C++ code and be done with it.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • 1
    8-bit and 16-bit CPUs need multiple instructions anyway to work with 32-bit integers at all. Working with 16-bit source data will *definitely* help there with a good compiler, e.g. using an immediate `0` for the upper half instead of loading it from memory and using up an extra register. The only issue might be 8->16 zero extension on a 16-bit ISA. – Peter Cordes Mar 12 '19 at 03:59
  • 1
    @PeterCordes Yep, 8-bit and 16-bit CPUs are screwed for anything but simple addition/subtraction and and/or/xor/not. – Alexey Frunze Mar 12 '19 at 04:01
  • @Alexey: Well, yes I care about 8 and 16 bot controllers. But as Peter already mentioned. such controllers will have to do multi-instruction operations anyway. – Silicomancer Mar 12 '19 at 08:09
  • @Peter: Do you think common embedded compilers are able to do that optimization? – Silicomancer Mar 12 '19 at 08:10
  • 1
    @Silicomancer: For GCC `-O3` targeting AVR, yes, try it on Godbolt. I've heard bad things about vendor compilers for some other architectures, but this is pretty basic. In any case, an array of `uint16_t` is at least as good for 8 or 16-bit micros as an array of `uint32_t`; if they do choose to zero a register instead of using an immediate, that's still cheaper than loading. Anyway, I wouldn't worry too much about it; most code that runs on microcontrollers is written specifically for them; you shouldn't worry too much about someone just blindly compiling your code on a microcontroller. – Peter Cordes Mar 12 '19 at 08:17