8

I've being benchmarking an algorithm, it's not necessary to know the details. The main components are a buffer(raw array of integers) and an indexer (integer - used to access the elements in buffer).

The fastest types for the buffer seem to be unsigned char, and both signed and unsigned versions of short, int, long. However char/signed char was slower. Difference: 1.07x.

For the indexer there was no difference between signed and unsigned types. However int and long were 1.21x faster than char and short.

Is there a type that should be used by default when considering performance and not memory consumption?

NOTE: The operations used on the elements of the buffer and the indexer were assignment, increment, decrement and comparison.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
NFRCR
  • 5,304
  • 6
  • 32
  • 37
  • 1
    How are you measuring this? Is it on a system with next to no other processing running? Are you counting it using timers, or are you using a JTAG connection to a dev board and counting CPU cycles? – Cloud Apr 04 '12 at 16:52
  • 2
    Yes it's important to know the details, because you're probably actually measuring memory bandwidth and type conversion at some point. – Jem Apr 04 '12 at 16:53
  • 8
    Have a look at `stdint.h`. You might be interested in the `int_fast32_t` type. (or whatever size you prefer) – Mysticial Apr 04 '12 at 16:53
  • I'm using std::clock() to measure the clocks taken to run the algorithm. The algorithm runs long enough for the results obtained with std::clock() to be valid. – NFRCR Apr 04 '12 at 16:55
  • Is your algorithm multi-threaded? – Branko Dimitrijevic Apr 04 '12 at 17:36

5 Answers5

9

Generally the biggest win comes from cacheing.

If your data values are small enough that they fit in 8 bits then you can fit more of the data in the CPU cache than if you used ints and wasted 3 bytes/value. If you are processing a block of data you get a huge speed advantage for cache hits.

The type of the index is less important, as long as it fits in a CPU register (ie don't try using a long long on an 8bit CPU) it will have the same speed

edit: it's also worth mentioning that measuring speed is tricky. You need to run the algorithm several times to allow for caching, you need to watch what else is running on the CPU and even what other hardware might be interrupting. Speed differences of 10% might be considered noise unless you are very careful.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
  • Could it be that for the indexer unsigned int is faster than unsigned short, because on x86 the array [] operator expects an unsigned int? – NFRCR Apr 04 '12 at 16:57
  • 1
    The OP is noticing that the larger datatypes are faster. Wouldn't that work against your argument that the difference is caused by cache? – Mysticial Apr 04 '12 at 17:02
  • 1
    @Mysticial "The fastest types for the buffer seem to be unsigned char". The OP is also talking about 7% differences, unless you are really careful this is difficult to measure on a general purpose OS+PC – Martin Beckett Apr 04 '12 at 17:06
  • 1
    @NFRCR - any decent compiler will handle indexing arrays very well, anything that matters is done at compile time. – Martin Beckett Apr 04 '12 at 17:07
  • Oh right, I overlooked the first part of the question. I retract my statement. :) – Mysticial Apr 04 '12 at 17:10
  • If I read the question correctly, we have the following situation: `unsigned char` < `int` < `signed char` (where "<" means "faster"). This can't be explained by caching. Your point about measurement is valid though. – Branko Dimitrijevic Apr 04 '12 at 17:34
  • once I measured 4 times optimisation replacing char by int in crc8 implementation (32bit system). I think the native type (with size == size of pointer) is fastest in a case when no caching is involved – Andriy Tylychko Jul 15 '15 at 17:11
2

It depends heavily on the underlying architecture. Usually fastest data types are those that are word-wide. In my experience with IA32 (x86-32), smaller/bigger than word data types incur in penalties, sometimes even more than one memory read for one single data.

Once on the CPU registers, usually data type length doesn't matter (if the whole data fits in one register, that is) but what operations you accomplish with them. Of course floating point operations are the most costly; the fastest being adding, subtracting (which is also comparing), bit-wise (shift and the like), and logical operations (and, or...).

m0skit0
  • 25,268
  • 11
  • 79
  • 127
  • Generally use `int` for local temporary variables (unless you need them to be wider in 64bit, then use `size_t` or whatever). Loading/storing to char or short is near-free. (`movzx` / `movsx` loads don't take an ALU uop at all, and are handled in the load port.) So arrays should use narrow types to minimize cache consumption. – Peter Cordes Mar 01 '16 at 04:35
1

There are no promises about which type is faster or slower. int is supposed to represent the natural word length of the machine, whatever that might mean, so it might go faster. Or slower, depending upon other factors.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Can you please comment some of those factors that might make word-wide types slower? – m0skit0 Apr 04 '12 at 18:10
  • @m0skit0: Bus speed, caching, or if the CPU can't naively handle that type, probably other factors. – Mooing Duck Apr 04 '12 at 18:13
  • Bus speed and cache should be optimized for word-size types in a well designed architecture. The CPU must excel at that data type because I said "word-size" types. – m0skit0 Apr 05 '12 at 10:52
  • @m0skit0: Bus speed and cache are often optimized for multiples of the word-size types rather than actually the word-size type, and even if it were 1:1, using data types that are half the word size causes them to load twice as fast through the cache/bus. On my pc, I can load ~8 `chars` in about the same speed as a single `int`. – Mooing Duck Sep 21 '17 at 22:29
0

The following are typedefs of fundamental integral types or extended integral types.

check fast mod. you can find out for other types(char) fast mod as well.

library is :: cstdint

uint_fast8_t :: my suggestion

http://www.cplusplus.com/reference/cstdint/

??you may need to know about the architecture of machine you are using!!

rulf
  • 1
  • 2
-1

As it was said int in most cases represent the machine word. So int will have the same length as processor register has, so no additional actions won't be done to put int to register and than back to RAM.

While if you use char it is 4 times smaller (on x86 systems) than int and also 4 times smaller than processor register. So before it will be put to RAM it should be truncated. As a result more time is used.

Furthermore, processor which has 32bits register can't perform operations with 8bits number. If char is add to char they both are put to register. So the each register will have 8bits of char value and 24bits of trash. Two 32bits values will be added and then the result will be back truncated to 8bits. The reason why char and short works the same time is the fact that the same number of additional operations is used. While for int additional operations are not done.

I would like to add that for processor int and unsigned int is completely the same as it treats them in the same way. For some compilers int and long int also may be the same.

So the fastest integer type is the type which length is the same as machine word. If you use types with smaller size than machine word the program will work slower.

Seagull
  • 3,319
  • 2
  • 31
  • 37
  • 6
    None of this is actually true on most modern machines. Machine words are 64 bits, but `int` is 32. Intel architecture has instructions to load and store bytes, and to operate on bytes, so truncation has 0 cost. Intel architecture also supports direct operations on bytes. And the "fastest integer type" will depend on what you're doing; on a modern machine, memory accesses and locality generally play a predominant role. – James Kanze Apr 04 '12 at 18:06
  • `movzx` / `movsx` zero or sign-extending loads to integer registers are nearly free. On Intel CPUs, they don't even take an ALU uop; it's all done in the load unit. And even if you do have high garbage in a register, [that doesn't stop you from doing an add/sub or whatever and then storing the low 8 of the result](http://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in), which won't be affected by high garbage. – Peter Cordes Mar 01 '16 at 04:30