13

I have a structure called log that has 13 chars in it. after doing a sizeof(log) I see that the size is not 13 but 16. I can use the __attribute__((packed)) to get it to the actual size of 13 but I wonder if this will affect the performance of the program. It is a structure that is used quite frequently.

I would like to be able to read the size of the structure (13 not 16). I could use a macro, but if this structure is ever changed ie fields added or removed, I would like the new size to be updated without changing a macro because I think this is error prone. Have any suggestion?

yan bellavance
  • 4,710
  • 20
  • 62
  • 93

5 Answers5

16

Yes, it will affect the performance of the program. Adding the padding means the compiler can use integer load instructions to read things from memory. Without the padding, the compiler must load things separately and do bit shifting to get the entire value. (Even if it's x86 and this is done by the hardware, it still has to be done).

Consider this: Why would compilers insert random, unused space if it was not for performance reasons?

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • 6
    Most hardware handles most unaligned loads without a penalty. The exception to the rule is when the access straddles some kind of boundary: cache line, page, etc. Mentioning instructions is misleading. In particular, if the working set does not fit into the cache (not an unusual situation), the benefit of fewer DRAM transactions for a "compressed" array will probably outweigh the extra cache accesses. Doubly so for structures written to disk. – Potatoswatter Aug 11 '10 at 01:58
  • 4
    @Potatoswatter: "most"? Maybe if "most machines are x86" your statement has some chance of being true, but last I checked most machines are embedded systems, cell phones, etc.. On most hardware, unaligned access means the compiler must generate code that performs the loads/stores byte-by-byte, possibly with bitshifting and bitwise or to assemble values, to work with larger types. This is a huge penalty. – R.. GitHub STOP HELPING ICE Aug 11 '10 at 05:54
  • @R: Pre-ARMv6 ARM doesn't support misalignment, according to Wikipedia. Aside from that, SPARC, and DSPs, most architectures do support it. Anyway, even tedious byte flipping done at CPU speed might not be slower than extra disk/flash/DRAM transfer time. – Potatoswatter Aug 11 '10 at 06:14
  • 2
    @Potatoswatter, There are more DSPs and CPU cores with sizable penalties for unaligned access out there than you want to believe. Note that SSE2 on x86 requires alignment as well. This is really one of those areas where it is *much* better to leave the default behavior alone, unless you have a very good reason. Even then, test and benchmark to be sure. – RBerteig Aug 11 '10 at 07:47
  • @Potatoswatter: Why would the compiler insert alignment padding if it was **not** for performance reasons? – Billy ONeal Aug 11 '10 at 11:45
  • @Photoswatter: Err... your answer revolves around performance reasons. (It's a good answer, I +1'd it) but I fail to see how it answers my question. – Billy ONeal Aug 11 '10 at 20:13
  • @Billy: Then I don't understand your question. There is a performance tradeoff and I never suggested otherwise. – Potatoswatter Aug 11 '10 at 23:33
  • @Potatoswatter: Ah -- I thought you responding to my comment asking for why a compiler would do that if it was **not** for performance reasons. – Billy ONeal Aug 12 '10 at 02:47
  • @Potatoswatter: Actually, I was misinformed about the above. Misalignment carries a penalty even on x86. – Billy ONeal Jan 16 '13 at 02:40
  • @BillyONeal, would you be able to say if the mentioned performance degradation is ONLY for instructions involving the "packed" structs. I asked a related question in https://stackoverflow.com/questions/66453335/performance-impact-of-attribute-packed-is-performance-impact-present-onl – aKumara Mar 05 '21 at 13:57
6

Don't use __attribute__((packed)). If your data structure is in-memory, allow it to occupy its natural size as determined by the compiler. If it's for reading/writing to/from disk, write serialization and deserialization functions; do not simply store cpu-native binary structures on disk. "Packed" structures really have no legitimate uses (or very few; see the comments on this answer for possible disagreeing viewpoints).

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 2
    There are other situations where you have to deal with bit-by-bit arranged data structures. For example, most SPI or I2C devices take bytes of data with a very specific structure. Given the choice between 20 or so bit shifting and masking operations, or a documented, packed data structure and well defined type punning, I'd take the latter. – detly Aug 11 '10 at 06:07
  • 2
    I would suggest that mapping structures onto hardware registers is a legitimate use on embedded systems for example. – jcoder Aug 11 '10 at 06:09
  • I would group these usages with writes to disk, as "serialization". It's questionable whether the compiler with `__attribute__((packed))` would generate better code than you could do by hand with macros, and the latter would be portable (to other C implementations on the same hardware), but I'll grant that this is one place it might make sense to use such a compiler extension. – R.. GitHub STOP HELPING ICE Aug 11 '10 at 06:34
  • @R - agreed that it sacrifices portability, but depending on the number of bytes and structure, I find it more readable. – detly Aug 11 '10 at 08:18
  • It is a common practice in filesystem related parsing (as an example, UBI header structures, which are endianness-independent since the data is guaranteed to be in Big Endian). I can assume that this practice could be used in network related code as well – Rerito Apr 15 '13 at 11:33
  • @Rerito: It's common practice, but that doesn't mean it's good. What's really needed is a type that encapsulates a fixed number of spaces of some integer type, and has members which are declared as occupying particular ranges of bits or bytes within that blob. Such a structure would allow 100% portable I/O code to be written on all machines, and would allow compilers to take advantage of things like bit-field stuff/extract features. – supercat Nov 21 '13 at 17:20
  • @supercat: But taking the address of such members would still be problematic. I think what's needed is just for people to stop making such stupid, backwards interfaces... – R.. GitHub STOP HELPING ICE Nov 21 '13 at 22:10
  • @R..: The members would have the same sorts of restrictions as bitfields; unlike bitfields, however, they'd be portable. I'm not clear what you mean, though, by "stupid, backwards interfaces" – supercat Nov 21 '13 at 22:16
  • I mean packing data in layouts that don't fit with natural alignment for the type in use. – R.. GitHub STOP HELPING ICE Nov 21 '13 at 22:21
  • Not all machines use the same natural alignment, and one really doesn't want to add gratuitous padding when sending e.g. radio packets which are hardware constrained to a 32-byte payload. – supercat Nov 21 '13 at 22:47
  • The natural alignment is simply the size of the type; that works for all systems because the required alignment must divide the size of the type. Padding is easily avoided by ordering the members correctly. – R.. GitHub STOP HELPING ICE Nov 21 '13 at 22:48
5

Yes, it can affect the performance. In this case, if you allocate an array of such structures with the ((packed)) attribute, most of them must end up unaligned (whereas if you use the default packing, they can all be aligned on 16 byte boundaries). Copying such structures around can be faster if they are aligned.

caf
  • 233,326
  • 40
  • 323
  • 462
5

Yes, it can affect performance. How depends on what it is and how you use it.

An unaligned variable can possibly straddle two cache lines. For example, if you have 64-byte cache lines, and you read a 4-byte variable from an array of 13-byte structures, there is a 3 in 64 (4.6%) chance that it will be spread across two lines. The penalty of an extra cache access is pretty small. If everything your program did was pound on that one variable, 4.6% would be the upper bound of the performance hit. If logging represents 20% of the program's workload, and reading/writing to the that structure is 50% of logging, then you're already at a small fraction of a percent.

On the other hand, presuming that the log needs to be saved, shrinking each record by 3 bytes is saving you 19%, which translates to a lot of memory or disk space. Main memory and especially the disk are slow, so you will probably be better off packing the log to reduce its size.


As for reading the size of the structure without worrying about the structure changing, use sizeof. However you like to do numerical constants, be it const int, enum, or #define, just add sizeof.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • Creating saves by fwrite-ing structures is probably not a good idea in the first place -- Moving to a different compiler or platform would make previous saves worthless. – Billy ONeal Aug 11 '10 at 20:12
  • @Billy: The argument also applies to slow DRAM ("main memory") not written to disk. Anyway, proper serialization simply requires converting to a standard endianness. – Potatoswatter Aug 11 '10 at 23:44
  • And ensuring the sizes of the types in your struct cannot change. – Billy ONeal Aug 12 '10 at 02:46
1

As with all other performance optimizations, you'll need to profile your code to find the right answer. The right answer will vary by architecture --- and how your use your structure.

If you're creating gigantic arrays the space savings from packing might mean the difference between fitting and not fitting in cache. Or your data might already fit into your cache, in which case it will make no difference. If you're allocating large numbers of the structures in an STL associative container that allocates the storage for your struct with operator new it might not matter at all --- operator new might round your storage up to something that's aligned anyway.

If most of your structures live on the stack the extra storage might already be optimized away anyway.

For a change this simple to test, I suggest building a timing rig and then trying things both ways. For further optimizations I suggest using a profiler to identify your bottlenecks and go from there.

razeh
  • 2,725
  • 1
  • 20
  • 27