What is the most efficient way to represent small values in a struct?

Question

Often I find myself having to represent a structure that consists of very small values. For example, Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't care, but sometimes, those structures are

used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;

In that case, I find myself having trouble deciding how to represent Foo efficiently. I have basically 4 options:

struct Foo {
    int a;
    int b;
    int c;
    int d;
};

struct Foo {
    char a;
    char b;
    char c;
    char d;
};

struct Foo {
    char abcd;
};

struct FourFoos {
    int abcd_abcd_abcd_abcd;
};

They use 128, 32, 8, 8 bits respectively per Foo, ranging from sparse to densely packed. The first example is probably the most linguistic one, but using it would essentially increase by 16 times the size of the program, which doesn't sound quite right. Moreover, most of the memory will be filled with zeroes and not be used at all, which makes me wonder if this isn't a waste. On the other hands, packing them densely brings an additional overhead for of reading them.

What is the computationally 'fastest' method for representing small values in a struct?

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/84791/discussion-on-question-by-viclib-what-is-the-most-efficient-way-to-represent-sma). — George Stocker, Jul 31 '15 at 11:24
As a few answers have pointed out; you haven't given us enough information. You seem to contradict yourself "Time efficiency is the goal" "but reading is important too." Pick what you mean by 'most effecient'. Once you do that; you've rightly found that you should benchmark, and that should be your next step. Asking our opinion without giving us the facts we need to form that opinion makes this an off-topic question. Give us the facts we need if you'd like for us to be able to answer your question. — George Stocker, Jul 31 '15 at 11:31
@GeorgeStocker I don't agree with this being put on hold. Although some of the answers have been opinion based, this is not an intrinsic quality of the question. dbush suggested a technique that the OP didn't think of, I posted benchmarks and the idea of writing the code generically. Various others have posted useful information about CPUs. I think there are a few good answers that are not primarily opinion-based, and this alone shows that the question is not primarily opinion-based. — Nir Friedman, Jul 31 '15 at 14:11
@NirFriedman Your best bet to get community consensus to re-open is either [meta](http://meta.stackoverflow.com) or the [C++ Chat room](https://chat.stackoverflow.com/rooms/10/loungec). — George Stocker, Jul 31 '15 at 14:16
Hey Vic, I see the C++ tag is gone. Did you do that? I was about to follow George's advice, and then saw there was no C++ tag anymore. I would think if you are using a C++ compiler and can therefore avail yourself of C++ features, C++ would be the more appropriate tag. — Nir Friedman, Aug 02 '15 at 04:48
I had the C++ tag because I was interested in C++ answers too, but seems like someone removed it... I'm not qualified to dispute. — MaiaVictor, Aug 02 '15 at 07:33
Do you usually want all of the values in the struct at the same time? Or do you make a pass where you only care about `a`, or a pass where you only care about `d` in half of the structs? — rob mayoff, Dec 09 '15 at 20:36
None of the answers seem to mention the most obvious solution: pack the four values into an `unsigned char` — M.M, Dec 16 '15 at 19:58

dbush · Answer 1 · 2015-10-31T17:32:20.107

34

For dense packing that doesn't incur a large overhead of reading, I'd recommend a struct with bitfields. In your example where you have four values ranging from 0 to 3, you'd define the struct as follows:

struct Foo {
    unsigned char a:2;
    unsigned char b:2;
    unsigned char c:2;
    unsigned char d:2;
}

This has a size of 1 byte, and the fields can be accessed simply, i.e. foo.a, foo.b, etc.

By making your struct more densely packed, that should help with cache efficiency.

Edit:

To summarize the comments:

There's still bit fiddling happening with a bitfield, however it's done by the compiler and will most likely be more efficient than what you would write by hand (not to mention it makes your source code more concise and less prone to introducing bugs). And given the large amount of structs you'll be dealing with, the reduction of cache misses gained by using a packed struct such as this will likely make up for the overhead of bit manipulation the struct imposes.

edited Oct 31 '15 at 17:32

answered Jul 30 '15 at 19:04

dbush

205,898
23
218
273

2

Depending on compiler, bitfields often do introduce the "overhead of reading" - the only difference is that the compiler generates the bit fiddling code rather than the programmer doing it by hand, and the compiler is able to be tuned better for the particular target machine. – Peter Jul 30 '15 at 19:11
1

I like this method the best but I'd definitely do some profiling and testing to make sure there's not too much of a speed hit. – rost0031 Jul 30 '15 at 19:15
1

Uhh bitfields require [quite a few considerations overhead-wise](http://blogs.msdn.com/b/oldnewthing/archive/2008/11/26/9143050.aspx)... – Qix - MONICA WAS MISTREATED Jul 30 '15 at 19:29
Unless the platform's processor has registers that are 2-bits wide, this will require extra effort to extract the bits from the structure, whether the struct is in an 8-bit register or in memory. – Thomas Matthews Jul 30 '15 at 19:31
9

with "billions of Foos" it might make up performance wise in cache efficiency, a cache miss is much more expensive than the few instuctions to align the bits – Glenn Teitelbaum Jul 30 '15 at 20:01
4

To build on that point: a single cache miss might typically cost on the order of 200 cycles. Even an L1/L2 miss to L3 cache can cost I think around 20 cycles. If you are cache missing 1% of the time, then most of your time is spent cache missing. On the flip side, the extra cycles to do the bit twiddling may have lower-than-apparent cost, if extracting multiple independent variables the processor may be able to retire multiple instructions per cycle. – Nir Friedman Jul 30 '15 at 21:43
There are processors where reading a bit is one instruction. Though they're typically microcontrollers and nothing in the x86 family does it. – slebetman Jul 31 '15 at 02:32
@slebetman there is the [bit test](https://en.wikipedia.org/wiki/Bit_Test) instruction family in x86 with `BT`, `BTS`, `BTR`, `BTC`. But in the OP's case the values are 2 bits, not 1, so those instructions cannot be used – phuclv Jul 31 '15 at 06:24
@LưuVĩnhPhúc: Ah, I didn't know that. They can still be used, it's just that they'll be 2 executed/3 written instructions instead of 1. Still quite fast. – slebetman Jul 31 '15 at 06:53
**The best reason to do it like this:** If you want to check performance against a regular `char a;char b;char c;` struct, you only have to remove the :2 and recompile - no rewriting necessary, so you can always benchmark both versions and even switch, when other factors make one of both faster.... – Falco Jul 31 '15 at 11:06
Although that uses only eight bits of storage for its members, it is by no means assured that the struct representation will be only one byte. Implementations may choose a storage unit larger than one byte to hold the bits, and they may, independently, choose to include padding in the struct representation. FWIW, however, GCC does neither with this particular structure. – John Bollinger Dec 09 '15 at 20:36

score 20 · Answer 2 · answered Jul 30 '15 at 18:50

20

Pack them only if space is a consideration - for example, an array of 1,000,000 structs. Otherwise, the code needed to do shifting and masking is greater than the savings in space for the data. Hence you are more likely to have a cache miss on the I-cache than the D-cache.

answered Jul 30 '15 at 18:50

stark

12,615
3
33
50

1

Simple and concise answer. This is usually the golden catch-all rule of bitfields. – Qix - MONICA WAS MISTREATED Jul 30 '15 at 19:30
5

This is common knowledge that I have discovered is often incorrect. http://stackoverflow.com/questions/16738579/is-using-a-vector-of-boolean-values-slower-than-a-dynamic-bitset. If you look at the benchmarks with the sieve of eratosthenes, you will see that vector beats vector handily at any size that does not entirely fit in cache. To quote Walter Bright: measuring gives you a leg up on experts who are too good to measure. – Nir Friedman Jul 30 '15 at 21:11
1

As I said, pack them if data size is a concern. The problem I have is people who pack a single struct - where the code to do read-modify write ends up being much larger than the space that is saved. The test you linked to is on 100M elements. – stark Jul 30 '15 at 21:27
What you wrote sounds to me entirely like you are saying: pack it only if you want to save space, as opposed to: pack it to save time, if it is really big. The user did say he has billions of structs, but I'm guessing he added that after you answered. – Nir Friedman Jul 30 '15 at 21:40
3

Most processors have a "load byte" instruction that executes in the same time as "load int", so the second form (`char`s, no bit-packing) should always be an improvement over the first one (`int`s, no bit-packing). – user253751 Jul 30 '15 at 23:09
@NirFriedman great resource, thank you. The information was actually there since the beginning, stark likely missed it, but it is OK. – MaiaVictor Jul 30 '15 at 23:12
@immibis That confirms my intuition, I would have guessed that there was no way the int version would be better than the char version. – Nir Friedman Jul 31 '15 at 00:14
Pack for storage, use `int` for temporaries. On x86, loading with `movzx` or `movsx` (sign/zero extend) is just as fast as plain `mov` of a byte or 32bit int. Going to have to downvode this answer, because loading/storing between 8bit struct members and 32bit temporaries is NOT slower. (at least on x86.) – Peter Cordes Jul 31 '15 at 01:57
@PeterCordes: You misunderstand. He's saying the code to read `a` from `char a` in a struct is faster than the code to read `a` from `char abcd` or `int abcd_abcd_abcd_abcd` in a struct because you don't have to do `<<` and `>>` to get `a`. But apparently Nir Friedman gave us a link that shows that at least sometimes `char abcd` can be as fast or faster than `char a`. – slebetman Jul 31 '15 at 02:29
@slebetman: Hmm, hopefully that's what he means. I took the blanket recommendation against packing as suggesting that storing each field as a full `int` would be good. I posted an answer that expands some on how `[u]int8_t` in memory but `int` for temporaries is optimal. And summarized some of what others have posted about how bitfields are great if you mostly just need to test / set bits, rather than unpack a 2-bit field to an int and multiply it or pass it to a function call. They can still be good since CPU instructions are cheaper than cache misses. Still, this isn't the best answer. – Peter Cordes Jul 31 '15 at 02:47
@immibis if you use types smaller than the native register size then the compiler would need to sign/zero extend and masking out the high bytes when needed, which is slower than using int. Only when the big size introduces more cache misses then using smaller type will be faster – phuclv Jul 31 '15 at 04:09
If you really need efficiency don't forget there are intrinsics like `_bittest` for dealing with single bits. I'm not 100% sure if they're faster than manually writing the shifts, but they probably are, and they can result in fewer/shorter instructions. – user541686 Jul 31 '15 at 07:59
OP stated "billions of Foos" so space (D Cache) will be a more critical resource than code size (D Cache) – Glenn Teitelbaum Jul 31 '15 at 15:13
"billions of foos" was a later edit. The original question just said "a big array". – stark Dec 14 '15 at 22:53

score 11 · Answer 3 · answered Jul 30 '15 at 19:07

There is no definitive answer, and you haven't given enough information to allow a "right" choice to be made. There are trade-offs.

Your statement that your "primary goal is time efficiency" is insufficient, since you haven't specified whether I/O time (e.g. to read data from file) is more of a concern than computational efficiency (e.g. how long some set of computations take after a user hits a "Go" button).

So it might be appropriate to write the data as a single char (to reduce time to read or write) but unpack it into an array of four int (so subsequent calculations go faster).

Also, there is no guarantee that an int is 32 bits (which you have assumed in your statement that the first packing uses 128 bits). An int can be 16 bits.

(Answering your question, by time efficiency I mean "how long the computation takes after a user hits a Go button". The program is a blackbox, you click play and it keep shifting bits until it finds an answer.) — MaiaVictor, Jul 30 '15 at 19:35
It's unlikely that unpacking to an array of `int` would be useful. Loading single fields to local temp `int` variables before use is a good idea, though, unless you need 8bit integer overflow. — Peter Cordes, Jul 31 '15 at 03:10

technosaurus · Answer 4 · 2015-07-31T03:53:50.903

Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't care, but sometimes, those structures are ...

There is another option: since the values 0 ... 3 likely indicate some sort of state, you could consider using "flags"

enum{
  A_1 = 1<<0,
  A_2 = 1<<1,
  A_3 = A_1|A_2,
  B_1 = 1<<2,
  B_2 = 1<<3,
  B_3 = B_1|B_2, 
  C_1 = 1<<4,
  C_2 = 1<<5,
  C_3 = C_1|C_2,
  D_1 = 1<<6,
  D_2 = 1<<7,
  D_3 = D_1|D_2,
  //you could continue to  ... D7_3 for 32/64 bits if it makes sense
}

This isn't much different than using bitfields for most situations, but can drastically reduce your conditional logic.

if ( a < 2 && b < 2 && c < 2 && d < 2) // .... (4 comparisons)
//vs.
if ( abcd & (A_2|B_2|C_2|D_2) !=0 ) //(bitop with constant and a 0-compare)

Depending what kinds of operations you will be doing on the data, it may make sense to use either 4 or 8 sets of abcd and pad out the end with 0s as needed. That could allow up to 32 comparisons to be replaced with a bitop and 0-compare. For instance, if you wanted to set the "1 bit" on all 8 sets of 4 in a 64 bit variable you can do uint64_t abcd8 = 0x5555555555555555ULL; then to set all the 2 bits you could do abcd8 |= 0xAAAAAAAAAAAAAAAAULL; making all values now 3

Addendum: On further consideration, you could use a union as your type and either do a union with char and @dbush's bitfields (these flag operations would still work on the unsigned char) or use char types for each a,b,c,d and union them with unsigned int. This would allow both a compact representation and efficient operations depending on what union member you use.

union Foo {
  char abcd; //Note: you can use flags and bitops on this too
  struct {
    unsigned char a:2;
    unsigned char b:2;
    unsigned char c:2;
    unsigned char d:2;
  };
};

Or even extended further

union Foo {
  uint64_t abcd8;  //Note: you can use flags and bitops on these too
  uint32_t abcd4[2];
  uint16_t abcd2[4];
  uint8_t  abcd[8];
  struct {
    unsigned char a:2;
    unsigned char b:2;
    unsigned char c:2;
    unsigned char d:2;
  } _[8];
};
union Foo myfoo = {0xFFFFFFFFFFFFFFFFULL};
//assert(myfoo._[0].a == 3 && myfoo.abcd[0] == 0xFF);

This method does introduce some endianness differences, which would also be a problem if you use a union to cover any other combination of your other methods.

union Foo {
  uint32_t abcd;
  uint32_t dcba; //only here for endian purposes
  struct { //anonymous struct
    char a;
    char b;
    char c;
    char d;
  };
};

You could experiment and measure with different union types and algorithms to see which parts of the unions are worth keeping, then discard the ones that are not useful. You may find that operating on several char/short/int types simultaneously gets automatically optimized to some combination of AVX/simd instructions whereas using bitfields does not unless you manually unroll them... there is no way to know until you test and measure them.

score 9 · Answer 5 · edited May 23 '17 at 11:51

Fitting your data set in cache is critical. Smaller is always better, because hyperthreading competitively shares the per-core caches between the hardware threads (on Intel CPUs). Comments on this answer include some numbers for costs of cache misses.

On x86, loading 8bit values with sign or zero-extension into 32 or 64bit registers (movzx or movsx) is literally just as fast as plain mov of a byte or 32bit dword. Storing the low byte of a 32bit register also has no overhead. (See Agner Fog's instruction tables and C / asm optimization guides here).

Still x86-specific: [u]int8_t temporaries are ok, too, but avoid [u]int16_t temporaries. (load/store from/to [u]int16_t in memory is fine, but working with 16bit values in registers has big penalties from the operand-size prefix decoding slowly on Intel CPUs.) 32bit temporaries will be faster if you want to use them as an array index. (Using 8bit registers doesn't zero the high 24/56bits, so it takes an extra instruction to zero or sign-extend, to use an 8bit register as an array index, or in an expression with a wider type (like adding it to an int.)

I'm unsure what ARM or other architectures can do as far as efficient zero/sign extension from single-byte loads, or for single-byte stores.

Given this, my recommendation is pack for storage, use int for temporaries. (Or long, but that will increase code size slightly on x86-64, because a REX prefix is needed to specify a 64bit operand size.) e.g.

int a_i = foo[i].a;
int b_i = foo[i].b;
...;
foo[i].a = a_i + b_i;

bitfields

Packing into bitfields will have more overhead, but can still be worth it. Testing a compile-time-constant-bit-position (or multiple bits) in a byte or 32/64bit chunk of memory is fast. If you actually need to unpack some bitfields into ints and pass them to a non-inline function call or something, that will take a couple extra instructions to shift and mask. If this gives even a small reduction in cache misses, this can be worth it.

Testing, setting (to 1) or clearing (to 0) a bit or group of bits can be done efficiently with OR or AND, but assigning an unknown boolean value to a bitfield takes more instructions to merge the new bits with the bits for other fields. This can significantly bloat code if you assign a variable to a bitfield very often. So using int foo:6 and things like that in your structs, because you know foo doesn't need the top two bits, is not likely to be helpful. If you're not saving many bits compared to putting each thing in it's own byte/short/int, then the reduction in cache misses won't outweigh the extra instructions (which can add up into I-cache / uop-cache misses, as well as the direct extra latency and work of the instructions.)

The x86 BMI1 / BMI2 (Bit-Manipulation) instruction-set extensions will make copying data from a register into some destination bits (without clobbering the surrounding bits) more efficient. BMI1: Haswell, Piledriver. BMI2: Haswell, Excavator(unreleased). Note that like SSE/AVX, this will mean you'd need BMI versions of your functions, and fallback non-BMI versions for CPUs that don't support those instructions. AFAIK, compilers don't have options to see patterns for these instructions and use them automatically. They're only usable via intrinsics (or asm).

Dbush's answer, packing into bitfields is probably a good choice, depending on how you use your fields. Your fourth option (of packing four separate abcd values into one struct) is probably a mistake, unless you can do something useful with four sequential abcd values (vector-style).

code generically, try both ways

For a data structure your code uses extensively, it makes sense to set things up so you can flip from one implementation to another, and benchmark. Nir Friedman's answer, with getters/setters is a good way to go. However, just using int temporaries and working with the fields as separate members of the struct should work fine. It's up to the compiler to generate code to test the right bits of a byte, for packed bitfields.

prepare for SIMD, if warranted

If you have any code that checks just one or a couple fields of each struct, esp. looping over sequential struct values, then the struct-of-arrays answer given by cmaster will be useful. x86 vector instructions have a single byte as the smallest granularity, so a struct-of-arrays with each value in a separate byte would let you quickly scan for the first element where a == something, using PCMPEQB / PTEST.

I'm still not sure I follow the separate abcd reasoning. Does that apply even if `abcd` values are very likely to be accessed together? That is, if I accessed `b` I'm probably reading `c` and `d` too shortly after. Great answer by the way. — MaiaVictor, Jul 31 '15 at 02:46
If you usually will need `c` and `d` after accessing `b`, then storing them together is the way to go. The only exception would be if you're accessing them sequentially, and can do something clever with vectors (like if you can do a packed-compare to find `Foo`s where `c` > `d`). If you're accessing your array of `Foo`s sequentially, it doesn't matter whether it's one stream of `abcd`s, or 4 separate streams. The HW prefetchers in recent Intel CPUs can keep track of something like 10 separate memory streams. — Peter Cordes, Jul 31 '15 at 02:51
You linked to my answer but referred to it as "dstark's answer". Which one did you mean? — dbush, Jul 31 '15 at 02:55
REX prefix will increase instruction size one byte, but then you'll be able to work with twice the number of values at a time. However in case the OP can do multiple similar things with the elements at once like that, SIMD is the better way to go. AVX2 or AVX512 will greatly increase the number of things one can do in an instruction — phuclv, Jul 31 '15 at 06:37
@LưuVĩnhPhúc: I was talking about unpacking a single field to a 32bit (or 64bit) temporary. I agree that SSE/AVX will be a better choice than SIMD-within-a-(gp)-register. — Peter Cordes, Jul 31 '15 at 06:41

score 7 · Answer 6 · answered Jul 30 '15 at 19:09

7

First, precisely define what you mean by "most efficient". Best memory utilization? Best performance?

Then implement your algorithm both ways and actually profile it on the actual hardware you intend to run it on under the actual conditions you intend to run it under once it's delivered.

Pick the one that better meets your original definition of "most efficient".

Anything else is just a guess. Whatever you choose will probably work fine, but without actually measuring the difference under the exact conditions you'd use the software, you'll never know which implementation would be "more efficient".

answered Jul 30 '15 at 19:09

Andrew Henle

32,625
3
24
56

I think it's possible to make fairly accurate predictions about some things, e.g. that storing each field in a separate `int32_t` will be slower than storing each field in its own `int8_t`. There's zero overhead for this on x86, and other answers say that working with 8bit values is efficient on ARM. I think this doesn't really answer the question. Obviously you can try everything and then benchmark, but sometimes you can save development time if you can confidently rule out options you have good reason to believe will be slower. – Peter Cordes Jul 31 '15 at 03:08
@PeterCordes the compiler may still need to do sign/zero extend `int8_t` sometimes, even on x86, although it will happen more on ARM or other RISC architectures – phuclv Jul 31 '15 at 04:15
@LưuVĩnhPhúc: Zero/sign extension has no overhead at all when done as part of a load on x86. (`movzx/movsz` cost the same as a `mov` load). You'll only get overhead if you use a `int8_t` local temporary as an array index, or use it an an expression with a wider type. See my answer for more details. – Peter Cordes Jul 31 '15 at 04:25
@PeterCordes no, on x86 you must extend the ax/al to eax if you want to do arithmetics on eax. On ARM the load will automatically extend it. But for example with some codes like `char c = ...; int x = somevalue; y = c + x;` then c would need to be sign extended on both platforms, if c wasn't loaded from memory before but the result of some intermediate expression – phuclv Jul 31 '15 at 04:30
@LưuVĩnhPhúc: Right. Don't use 8bit local temporaries (unless you want them for 8bit int overflow). Only use 8bit values in RAM. Your example is a case of using a `char` in an expression with a wider type, which is one of the cases where I said x86 *would* have overhead. You wouldn't have this overhead if `c` was an int that you loaded from a `char` field (because the compiler would use `movsx` instead of `mov`, at no extra cost.) I explained this in more detail in my answer. (unless I made it unclear. Let me know if my answer could be improved.) – Peter Cordes Jul 31 '15 at 04:58

score 5 · Answer 7 · answered Jul 30 '15 at 23:43

I think the only real answer can be to write your code generically, and then profile the full program with all of them. I don't think this will take that much time, though it may look a little more awkward. Basically, I'd do something like this:

template <bool is_packed> class Foo;
using interface_int = char;

template <>
class Foo<true> {
    char m_a, m_b, m_c, m_d;
 public: 
    void setA(interface_int a) { m_a = a; }
    interface_int getA() { return m_a; }
    ...
}

template <>
class Foo<false> {
  char m_data;
 public:
    void setA(interface_int a) { // bit magic changes m_data; }
    interface_int getA() { // bit magic gets a from m_data; }
}

If you just write your code like this instead of exposing the raw data, it will be easy to switch implementations and profile. The function calls will get inlined and will not impact performance. Note that I just wrote setA and getA instead of a function that returns a reference, this is more complicated to implement.

score 4 · Answer 8 · answered Jul 31 '15 at 00:25

Code it with ints

treat the fields as ints.

blah.x in all your code, except the declarion will be all you will be doing. Integral promotion will take care of most cases.

When you are all done, have 3 equivalant include files: an include file using ints, one using char and one using bitfields.

And then profile. Don't worry about it at this stage, because its premature optimization, and nothing but your chosen include file will change.

score 4 · Answer 9 · 2015-12-16T20:04:06.297

Massive Arrays and Out of Memory Errors

the whole program consists of a big array of billions of Foos;

First things first, for #2, you might find yourself or your users (if others run the software) often being unable to allocate this array successfully if it spans gigabytes. A common mistake here is to think that out of memory errors mean "no more memory available", when they instead often mean that the OS could not find a contiguous set of unused pages matching the requested memory size. It's for this reason that people often get confused when they request to allocate a one gigabyte block only to have it fail even though they have 30 gigabytes of physical memory free, e.g. Once you start allocating memory in sizes that span more than, say, 1% of the typical amount of memory available, it's often time to consider avoiding one giant array to represent the whole thing.

So perhaps the first thing you need to do is rethink the data structure. Instead of allocating a single array of billions of elements, often you'll significantly reduce the odds of running into problems by allocating in smaller chunks (smaller arrays aggregated together). For example, if your access pattern is solely sequential in nature, you can use an unrolled list (arrays linked together). If random access is needed, you might use something like an array of pointers to arrays which each span 4 kilobytes. This requires a bit more work to index an element, but with this kind of scale of billions of elements, it's often a necessity.

Access Patterns

One of the things unspecified in the question are the memory access patterns. This part is critical for guiding your decisions.

For example, is the data structure solely traversed sequentially, or is random access needed? Are all of these fields: a, b, c, d, needed together all the time, or can they be accessed one or two or three at a time?

Let's try to cover all the possibilities. At the scale we're talking about, this:

struct Foo {
    int a1;
    int b1;
    int c1;
    int d1
};

... is unlikely to be helpful. At this kind of input scale, and accessed in tight loops, your times are generally going to be dominated by the upper levels of memory hierarchy (paging and CPU cache). It no longer becomes quite as critical to focus on the lowest level of the hierarchy (registers and associated instructions). To put it another way, at billions of elements to process, the last thing you should be worrying about is the cost of moving this memory from L1 cache lines to registers and the cost of bitwise instructions, e.g. (not saying it's not a concern at all, just saying it's a much lower priority).

At a small enough scale where the entirety of the hot data fits into the CPU cache and a need for random access, this kind of straightforward representation can show a performance improvement due to the improvements at the lowest level of the hierarchy (registers and instructions), yet it would require a drastically smaller-scale input than what we're talking about.

So even this is likely to be a considerable improvement:

struct Foo {
    char a1;
    char b1;
    char c1;
    char d1;
};

... and this even more:

// Each field packs 4 values with 2-bits each.
struct Foo {
    char a4; 
    char b4;
    char c4;
    char d4;
};

_{* Note that you could use bitfields for the above, but bitfields tend to have caveats associated with them depending on the compiler being used. I've often been careful to avoid them due to the portability issues commonly described, though this may be unnecessary in your case. However, as we adventure into SoA and hot/cold field-splitting territories below, we'll reach a point where bitfields can't be used anyway.}

This code also places a focus on horizontal logic which can start to make it easier to explore some further optimization paths (ex: transforming the code to use SIMD), as it's already in a miniature SoA form.

Data "Consumption"

Especially at this kind of scale, and even more so when your memory access is sequential in nature, it helps to think in terms of data "consumption" (how quickly the machine can load data, do the necessary arithmetic, and store the results). A simple mental image I find useful is to imagine the computer as having a "big mouth". It goes faster if we feed it large enough spoonfuls of data at once, not little teeny teaspoons, and with more relevant data packed tightly into a contiguous spoonful.

Hot/Cold Field Splitting

The above code so far is making the assumption that all of these fields are equally hot (accessed frequently), and accessed together. You may have some cold fields or fields that are only accessed in critical code paths in pairs. Let's say that you rarely access c and d, or that your code has one critical loop that accesses a and b, and another that accesses c and d. In that case, it can be helpful to split it off into two structures:

struct Foo1 {
    char a4; 
    char b4;
};
struct Foo2 {
    char c4;
    char d4;
};

Again if we're "feeding" the computer data, and our code is only interested in a and b fields at the moment, we can pack more into spoonfuls of a and b fields if we have contiguous blocks that only contain a and b fields, and not c and d fields. In such a case, c and d fields would be data the computer can't digest at the moment, yet it would be mixed into the memory regions in between a and b fields. If we want the computer to consume data as quickly as possible, we should only be feeding it the relevant data of interest at the moment, so it's worth splitting the structures in these scenarios.

SIMD SoA for Sequential Access

Moving towards vectorization, and assuming sequential access, the fastest rate at which the computer can consume data will often be in parallel using SIMD. In such a case, we might end up with a representation like this:

struct Foo1 {
    char* a4n;
    char* b4n;
};

... with careful attention to alignment and padding (the size/alignment should be a multiple of 16 or 32 bytes for AVX or even 64 for futuristic AVX-512) necessary to use faster aligned moves into XMM/YMM registers (and possibly with AVX instructions in the future).

AoSoA for Random/Multi-Field Access

Unfortunately the above representation can start to lose a lot of the potential benefits if a and b are accessed frequently together, especially with a random access pattern. In such a case, a more optimal representation can start looking like this:

struct Foo1 {
    char a4x32[32];
    char b4x32[32];
};

... where we're now aggregating this structure. This makes it so the a and b fields are no longer so spread apart, allowing groups of 32 a and b fields to fit into a single 64-byte cache line and accessed together quickly. We can also fit 128 or 256 a or b elements now into an XMM/YMM register.

Profiling

Normally I try to avoid general wisdom advice in performance questions, but I noticed this one seems to avoid the details that someone who has profiler in hand would typically mention. So I apologize if this comes off a bit as patronizing or if a profiler is already being actively used, but I think the question warrants this section.

As an anecdote, I've often done a better job (I shouldn't!) at optimizing production code written by people who have far superior knowledge than me about computer architecture (I worked with a lot of people who came from the punch card era and can understand assembly code at a glance), and would often get called in to optimize their code (which felt really odd). It's for one simple reason: I "cheated" and used a profiler (VTune). My peers often didn't (they had an allergy to it and thought they understood hotspots just as well as a profiler and saw profiling as a waste of time).

Of course the ideal is to find someone who has both the computer architecture expertise and a profiler in hand, but lacking one or the other, the profiler can give the bigger edge. Optimization still rewards a productivity mindset which hinges on the most effective prioritization, and the most effective prioritization is to optimize the parts that truly matter the most. The profiler gives us detailed breakdowns of exactly how much time is spent and where, along with useful metrics like cache misses and branch mispredictions which even the most advanced humans typically can't predict anywhere close to as accurate as a profiler can reveal. Furthermore, profiling is often the key to discovering how the computer architecture works at a more rapid pace by chasing down hotspots and researching why they exist. For me, profiling was the ultimate entry point into better understanding how the computer architecture actually works and not how I imagined it to work. It was only then that the writings of someone as experienced in this regard as Mysticial started to make more and more sense.

Interface Design

One of the things that might start to become apparent here is that there are many optimization possibilities. The answers to this kind of question are going to be about strategies rather than absolute approaches. A lot still has to be discovered in hindsight after you try something, and still iterating towards more and more optimal solutions as you need them.

One of the difficulties here in a complex codebase is leaving enough breathing room in the interfaces to experiment and try different optimization techniques, to iterate and iterate towards faster solutions. If the interface leaves room to seek these kinds of optimizations, then we can optimize all day long and often get some marvelous results if we're measuring things properly even with a trial and error mindset.

To often leave enough breathing room in an implementation to even experiment and explore faster techniques often requires the interface designs to accept data in bulk. This is especially true if the interfaces involve indirect function calls (ex: through a dylib or a function pointer) where inlining is no longer an effective possibility. In such scenarios, leaving room to optimize without cascading interface breakages often means designing away from the mindset of receiving simple scalar parameters in favor of passing pointers to whole chunks of data (possibly with a stride if there are various interleaving possibilities). So while this is straying into a pretty broad territory, a lot of the top priorities in optimizing here are going to boil down to leaving enough breathing room to optimize implementations without cascading changes throughout your codebase, and having a profiler in hand to guide you the right way.

TL;DR

Anyway, some of these strategies should help guide you the right way. There are no absolutes here, only guides and things to try out, and always best done with a profiler in hand. Yet when processing data of this enormous scale, it's always worth remembering the image of the hungry monster, and how to most effectively feed it these appropriately-sized and packed spoonfuls of relevant data.

Beautifully written, abstract yet concise - I wish all SO answers could be this good. Thank you Ike! — J.J, Dec 16 '15 at 16:42

score 3 · Answer 10 · answered Jul 30 '15 at 20:38

3

Let's say, you have a memory bus that's a little bit older and can deliver 10 GB/s. Now take a CPU at 2.5 GHz, and you see that you would need to handle at least four bytes per cycle to saturate the memory bus. As such, when you use the definition of

struct Foo {
    char a;
    char b;
    char c;
    char d;
}

and use all four variables in each pass through the data, your code will be CPU bound. You can't gain any speed by a denser packing.

Now, this is different when each pass only performs a trivial operation on one of the four values. In that case, you are better off with a struct of arrays:

struct Foo {
    size_t count;
    char* a;    //a[count]
    char* b;    //b[count]
    char* c;    //c[count]
    char* d;    //d[count]
}

answered Jul 30 '15 at 20:38

cmaster - reinstate monica

38,891
9
62
106

I'm not sure I follow the struct of arrays example. Do you mean storing the data for every Foo in a single struct? Wouldn't that lead to more cache misses since now "close" data (i.e., the `a` and `b` from the same Foo) is in distant places? – MaiaVictor Jul 30 '15 at 23:02
1

@Viclib: Yes, you've understood the layout correctly. If you access sequential `Foo`s, but only need the `a` and `b`, not the other fields, then it's a win to use this struct-of-arrays method. This allows things like `c[i] = a[i] + b[i]` to be implemented with vector instructions. e.g. x86 can do two 128b loads and a `PADDB`. – Peter Cordes Jul 31 '15 at 02:03
@cmaster: "You can't gain any speed by a denser packing." This is only true for sequential access, and with the other cores in your system not using any memory bandwidth. For random access (where you might touch an element again and find it still cached), fitting more elements into the same cache size is important. – Peter Cordes Jul 31 '15 at 06:33
there is a technique called [field splitting](http://www.cis.upenn.edu/~cis570/slides/lecture16.pdf) or [structure splitting](http://www.capsl.udel.edu/conferences/open64/2008/Papers/111.pdf), where fields that are more frequently accessed in structures/objects are splitted into a separate object and cold fields in the other, in order to achieve higher cache utilization. The Intel Profiler is able to do this but I can't find the article right now – phuclv Jul 31 '15 at 06:48

score 3 · Answer 11 · answered Jul 31 '15 at 06:04

You've stated the common and ambiguous C/C++ tag.

Assuming C++, make the data private and add getters/ setters. No, that will not cause a performance hit - providing the optimizer is turned on.

You can then change the implementation to use the alternatives without any change to your calling code - and therefore more easily finesse the implementation based on the results of the bench tests.

For the record, I'd expect the struct with bit fields as per @dbush to be most likely the fastest given your description.

Note all this is around keeping the data in cache - you may also want to see if the design of the calling algorithm can help with that.

score 3 · Answer 12 · answered Dec 11 '15 at 01:16

Getting back to the question asked :

used in a tight loop;

their values are read a billion times/s, and that is the bottleneck of the program;

the whole program consists of a big array of billions of Foos;

This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.

As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.

It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.

But there is no single right answer to this question, IMO.

score 2 · Answer 13 · answered Jul 30 '15 at 19:44

The most efficient, performance / execution, is to use the processor's word size. Don't make the processor perform extra work of packing or unpacking.

Some processors have more than one efficient size. Many ARM processors can operate in 8/32 bit mode. This means that the processor is optimized for handling 8 bit quantities or 32-bit quantities. For a processor like this, I recommend using 8-bit data types.

Your algorithm has a lot to do with the efficiency. If you are moving data or copying data you may want to consider moving data 32-bits at a time (4 8-bit quantities). The idea here is to reduce the number of fetches by the processor.

For performance, write your code to make use of registers, such as using more local variables. Fetching from memory into registers is more costly than using registers directly.

Best of all, check out your compiler optimization settings. Set your compile for the highest performance (speed) settings. Next, generate assembly language listings of your functions. Review the listing to see how the compiler generated code. Adjust your code to improve the compiler's optimization capabilities.

Can't make blanket statements like this; sometimes the extra work of packing/unpacking pays back by reducing cache misses. — Nir Friedman, Jul 30 '15 at 21:25
On x86, zero or sign extending 8 bit loads to 32bit temporaries is free. So is storing the low 8 bits of a 32bit register. Use small types in storage arrays, but int for temporaries. — Peter Cordes, Jul 31 '15 at 02:06
Good suggestion to use locals, to make sure the compiler doesn't keep reloading the same value from memory because it doesn't know which pointers might alias each other. Copying whole structs should generate efficient code, but you should check. I think I've seen gcc emit code that copies fields one at a time. — Peter Cordes, Jul 31 '15 at 03:02

John Bollinger · Answer 14 · 2015-12-09T20:50:14.763

If what you're after is efficiency of space, then you should consider avoiding structs altogether. The compiler will insert padding into your struct representation as necessary to make its size a multiple of its alignment requirement, which might be as much as 16 bytes (but is more likely to be 4 or 8 bytes, and could after all be as little as 1 byte).

If you use a struct anyway, then which to use depends on your implementation. If @dbush's bitfield approach yields one-byte structures then it's hard to beat that. If your implementation is going to pad the representation to at least four bytes no matter what, however, then this is probably the one to use:

struct Foo {
    char a;
    char b;
    char c;
    char d;
};

Or I guess I would probably use this variant:

struct Foo {
    uint8_t a;
    uint8_t b;
    uint8_t c;
    uint8_t d;
};

Since we're supposing that your struct is taking up a minimum of four bytes, there is no point in packing the data into smaller space. That would be counter-productive, in fact, because it would also make the processor do the extra work packing and unpacking the values within.

For handling large amounts of data, making efficient use of the CPU cache provides a far greater win than avoiding a few integer operations. If your data usage pattern is at least somewhat systematic (e.g. if after accessing one element of your erstwhile struct array, you are likely to access a nearby one next) then you are likely to get a boost in both space efficiency and speed by packing the data as tightly as you can. Depending on your C implementation (or if you want to avoid implementation dependency), you might need to achieve that differently -- for instance, via an array of integers. For your particular example of four fields, each requiring two bits, I would consider representing each "struct" as a uint8_t instead, for a total of 1 byte each.

Maybe something like this:

#include <stdint.h>

#define NUMBER_OF_FOOS 1000000000
#define A 0
#define B 2
#define C 4
#define D 6

#define SET_FOO_FIELD(foos, index, field, value) \
    ((foos)[index] = (((foos)[index] & ~(3 << (field))) | (((value) & 3) << (field))))
#define GET_FOO_FIELD(foos, index, field) (((foos)[index] >> (field)) & 3)

typedef uint8_t foo;

foo all_the_foos[NUMBER_OF_FOOS];

The field name macros and access macros provide a more legible -- and adjustable -- way to access the individual fields than would direct manipulation of the array (but be aware that these particular macros evaluate some of their arguments more than once). Every bit is used, giving you about as good cache usage as it is possible to achieve through choice of data structure alone.

Mark Nilsen · Answer 15 · 2015-12-16T17:51:39.977

I did video decompression for a while. The fastest thing to do is something like this:

short ABCD; //use a 16 bit data type for your example

and set up some macros. Maybe:

#define GETA ((ABCD >> 12) & 0x000F)
#define GETB ((ABCD >> 8) & 0x000F)
#define GETC ((ABCD >> 4) & 0x000F)
#define GETD (ABCD  & 0x000F)  // no need to shift D

In practice you should try to be moving 32 bit longs or 64 bit long long because thats the native MOVE size on most modern processors.

Using a struct will always create the overhead in your compiled code of extra instructions from the base address of you struct to the field. So get away from that if you really want to tighten your loop.

Edit: Above example gives you 4 bit values. If you really just need values of 0..3 then you can do the same things to pull out your 2 bit numbers so,,,GETA might look like this:

GETA ((ABCD >> 14) & 0x0003)

And if you are really moving billions of things things, and I don't doubt it, just fill up a 32bit variable and shift and mask your way through it.

Hope this helps.

Are you sure the macros improve the performance over functions? — MaiaVictor, Jan 28 '16 at 23:38