44

Does alignment really matter for performance in C++11?

There is an advice in Stroustrup's book to order the members in a struct beginning from the biggest to the smallest. But I wonder if someone has made measurements to actually see if this makes any difference, and if it is worth it to think about when writing code.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
user3111311
  • 7,583
  • 7
  • 33
  • 48
  • 12
    "But I wonder if someone has made measurements to actually see if this makes any difference," - you could do that and report back.... – Mitch Wheat Dec 28 '13 at 00:40
  • 5
    @MitchWheat If I knew how I would not have asked. – user3111311 Dec 28 '13 at 00:44
  • 3
    you write some code with a struct aligned one way and then the other and you access it in a loop for say a million iterations and you time it. Pretty simple program. – Mitch Wheat Dec 28 '13 at 00:45
  • 5
    It really depends on architecture. Some processors are simply unable to handle unaligned data, and as a result, require arithmetic at the software level to split an integer over an alignment boundary, which is obviously going to waste cycles. – Mark H Dec 28 '13 at 00:46
  • Note that the order is not always preserved by the compiler: http://stackoverflow.com/questions/281045/do-class-struct-members-always-get-created-in-memory-in-the-order-they-were-decl – Alexei Averchenko Dec 28 '13 at 00:47
  • 4
    If you cannot measure a performance difference, why would it matter to you? – Potatoswatter Dec 28 '13 at 01:13
  • 3
    A real-life data point: In a Java JVM we changed the way Java object data fields are allocated, sorting by size (with some limitations). The result was, IIRC, about a 10% performance improvement on a server benchmark, due to the storage savings alone. (Ultimately, by removing "excess" space in Strings and other standard objects we achieved a 30% performance improvement overall.) – Hot Licks Jan 02 '14 at 17:39

2 Answers2

97

Alignment matters not only for performance, but also for correctness. Some architectures will fail with an processor trap if the data is not aligned correctly, or access the wrong memory location. On others, access to unaligned variables is broken into multiple accesses and bitshifts (often inside the hardware, sometimes by OS trap handler), losing atomicity.

The advice to sort members in descending order of size is for optimal packing / minimum space wasted by padding, not for alignment or speed. Members will be correctly aligned no matter what order you list them in, unless you request non-conformant layout using specialized pragmas (i.e. the non-portable #pragma pack) or keywords. Although total structure size is affected by padding and also affects speed, often there is another ordering that is optimal.

For best performance, you should try to get members which are used together into the same cache line, and members that are accessed by different threads into different cache lines. Sometimes that means a lot of padding to get a cross-thread shared variable alone in its own cache line. But that's better than taking a performance hit from false sharing.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • 2
    This is a top-notch answer – Lightness Races in Orbit Dec 28 '13 at 00:51
  • 1
    "but also for correctness" It's not possible to order data members in a class such that the members are illegally aligned; such illegally aligned accesses are a separate issue generally arising from aliasing violations rather than member ordering. – bames53 Dec 28 '13 at 00:52
  • 1
    @bames53: I was just editing to clarify that. It is possible, but only with non-standard compiler-specific packing pragmas. – Ben Voigt Dec 28 '13 at 00:53
  • 3
    Perhaps it would be beneficial to explain why sorting helps: the compiler will respect alignment rules, but also has to respect member order. If you sort by size, the compiler needs to insert less padding. This is because sorting by size in practice amounts to sorting by alignment restrictions (which is what you actually should do - a `char[53]` should go last) – MSalters Dec 28 '13 at 01:45
  • @MSalters: Depends whether you consider a `char[53]` to be one object of size 53, or 53 objects each of size 1. But yes, padding really is minimized by grouping objects of like alignment together. And that still isn't the best for performance. – Ben Voigt Dec 30 '13 at 19:28
  • -1, A assumes one usage pattern, for some use cases it might be faster to have smaller footprint(more objects fit in L* cache), and also false sharing might not be a problem if you use data in a certain way. Also even for small amount of false sharing false sharing is not a perf problem... – NoSenseEtAl Dec 31 '13 at 12:33
  • @NoSenseEtAl: Certainly some data structures are never accessed from multiple threads. I don't see how that makes my answer wrong. I said "members accessed from multiple threads should go into separate cache lines", if there are no such members, that requirement is trivially met. Similarly, if *all* members are commonly used together, then the requirement of putting members used together into as few cache lines as possible is trivially met by packing as tightly as possible. I'm not assuming a particular usage pattern, I'm considering the simpler cases to be special cases. – Ben Voigt Dec 31 '13 at 17:33
  • @NoSenseEtAl: Also, I did point out that smaller footprint does affect cache performance, and that sometimes small footprint is optimal for performance. But often it is not, and it is important for readers to understand why not and what other considerations are in play. – Ben Voigt Dec 31 '13 at 17:35
  • " Although total structure size is affected by padding and also affects speed, often there is another ordering that is optimal." == me removing -1 :) – NoSenseEtAl Jan 02 '14 at 07:57
  • I'm doing this alignment thing to only beautify my code not knowing it has performance gain. – mr5 Jan 02 '14 at 13:44
  • @mr5 What "alignment thing"? Do you mean you really think that going through all class declarations sorting members from largest to smallest "beautifies code"? On the contrary, for non-trivial classes, it usually makes declarations look horrible & illogical. It means you must constantly check `sizeof` every member type you use & update based on it. More importantly, in some cases, _it simply isn't possible_ since members can depend upon other members being initialised first, regardless of what their `sizeof`s happen to be. **None** of these things lead to "beautiful" code or coding experience. – underscore_d Aug 13 '16 at 17:53
  • According to https://www.youtube.com/watch?v=BP6NxVxDQIs (starting at minute 33) pragma pack does not affect speed on newer processors anymore. – user1911091 Jul 29 '19 at 07:02
  • 1
    @user1911091: I didn't watch to see what the claim was, but pragma pack still affects correctness. You can't use unaligned variables with most SSE instructions. You can't use unaligned addresses with interlocked instructions (x86 LOCK prefix). Speed considerations are secondary to having the program work. – Ben Voigt Jul 29 '19 at 13:12
10

Just to add to Ben's great answer:

Defining struct members in the same order they are later accessed in your application will reduce cache misses and possibly increase performance. This will work provided the entire structure does not fit into L1 cache.

On the other hand, ordering the members from biggest to smallest may reduce overall memory usage, which may be important when storing an array of small structures.

Let's assume that for an architecture (I don't know them that well, I think that would be the case for default settings 32bit gcc, someone will correct me in comments) this structure:

struct MemoryUnused {
  uint8_t val0;
  uint16_t val1;
  uint8_t val2;
  uint16_t val3;
  uint8_t val4;
  uint32_t val5;
  uint8_t val6;
}

takes 20 bytes in memory, while this:

struct MemoryNotLost {
  uint32_t val5;
  uint16_t val1;
  uint16_t val3;
  uint8_t val0;
  uint8_t val2;
  uint8_t val4;
  uint8_t val6;
}

Will take 12. That's 8 bytes lost due to padding, and it's a 67% increase in size of the smallers struct. With a large array of such structs, the gain would be significant and, simply because of the amount of used memory, will decrease the amount of cache misses.

Dariusz
  • 21,561
  • 9
  • 74
  • 114
  • 1
    While you have a theoretical point, your argument seems pretty mute in all circumstances I know about: First, only very few `struct`s are always accessed in the same order. Second, a `struct` that does not fit into level 1 cache is a veritable monster and should never be produced. Third, we frequently have many small objects in large arrays, and here only one thing matters for performance: the overall size of the objects. And fourth, we should not have unrelated parts within a single structure anyway, that is against the spirit of object orientation. – cmaster - reinstate monica Jan 02 '14 at 12:18
  • 2
    @cmaster First, the question is theoretical. Second, we use multiple char[] in a struct to store user input data in frontend - we may not be proud of it, but that's the case. Pointers to multiple long strings would be as bad. Third - that's the point, overall size may be reduced with ordering. Fourth - I do not understand what you're addressing here. Certainly not my post. – Dariusz Jan 02 '14 at 12:22
  • 1
    @cmaster: Quite the contrary... whenever you compose objects to make a larger object, you have operations (the subobject behaviors) that access only a subset of the complete object. That's not "against the spirit of OO", that's normal. – Ben Voigt Jan 02 '14 at 17:02
  • And your conclusion is wrong. If there is a large array of such structs and there exists some loop that uses the `int32_t` member (whose name strangely changed from `val5` to `val0`) and the `int8_t val6` member, the latter memory layout will cause twice as many cache misses as the former (padded) one. There probably exists a packed layout that doesn't suffer this problem, but simple size sorting isn't it. – Ben Voigt Jan 02 '14 at 17:06
  • 1
    @BenVoigt I changed the names. I didn't think they were meaningful, but now I realize they can be confusing. Thanks for the note. Does your last note still apply? If so, please elaborate, since I am not certain I understand it. – Dariusz Jan 02 '14 at 17:24
  • @cmaster - The term is "moot". – Hot Licks Jan 02 '14 at 17:31
  • @Dariusz: By sorting by size, you've moved `val5` and `val6` apart in memory, so a single cache line load won't contain both (well for a large enough cache line, it might. Pretend the structure is larger so it doesn't). If there was a loop which accessed only those two members, the packed layout would actually be slower since there would be twice as many RAM accesses required. – Ben Voigt Jan 02 '14 at 17:57
  • @BenVoigt You are of course right - but these are separate points. You might consider the first optimizing for speed and the other for minimizing memory usage. – Dariusz Jan 02 '14 at 18:12
  • @BenVoigt - But that assumes they actually would be accessed in that order. There are all sorts of reasons why fields might be in a particular order, and the odds that they're listed in order of reference is fairly small. – Hot Licks Jan 02 '14 at 20:43