structure padding - what is the purpose of natural alignment?

Question

I was learning about structure padding and data alignment. I came about this point that all the elements of the structure in the memory should be in natural alignment. so for example if I have following structure declared:

   struct align{
    char        c;
    double      d;
    int         s;
   };

If I take a 32 bit architecture, then it fetches 4 bytes at a time.So keeping this point in mind,if I start padding I will get(my assumption):
1byte(char) + 3bytes(padding) + 8bytes(double) + 4bytes(int) ---------> 1

all these shall be fetched with minimum machine cycles.

But originally the following is happening:
1byte(char) + 7bytes(padding) + 8bytes(double) + 4bytes(int) ----------> 2

why is it that we need this natural alignment for double when we could save 4bits while going with method 1 (while fetching each element with same no. of machine cycles in both cases) ?

The main purpose of natural alignment is to avoid misaligned access to data members in the structure — which can slow things down (sometimes radically — the DEC Alpha was particularly bad). 'Natural alignment' for an N-byte quantity (N = 1, 2, 4, 8, sometimes 16) is usually a multiple of its size, so the correct alignment for an 8-byte `double` is on a multiple of 8 bytes from the start of the structure (and there is tail padding after the `int` to take the size of the structure to 24 bytes total). Some other machines use 4-byte alignment for `double`; then the size will be 16 bytes. — Jonathan Leffler, Sep 04 '14 at 05:16
And compilers rightfully usually regard speed as more important than space. — Jonathan Leffler, Sep 04 '14 at 05:17

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Natural alignment refers to the size of the variable, not the size of the processor register and/or data path. A floating point double is 8 bytes, and so its natural alignment is 8 bytes. To be more precise, the natural alignment is the smallest power of 2 that is large enough to hold the variable, that definition covers the case of "long double" or x86 extended precision which is a 10-byte variable and whose natural alignment is a multiple of 16 bytes. For x86 processors see the optimization manual and search for alignment, you will find this is a subject rich in detail and specifics vary by micro-architecture, even within the same processor family. In particular, section 3.6.4 Alignment says

For best performance, align data as follows:

Align 8-bit data at any address.

Align 16-bit data to be contained within an aligned 4-byte word.

Align 32-bit data so that its base address is a multiple of four.

Align 64-bit data so that its base address is a multiple of eight.

Align 80-bit data so that its base address is a multiple of sixteen.

Align 128-bit data so that its base address is a multiple of sixteen.

The Pentium 4 is a 32-bit processor, part of the IA-32 family, yet it has a 64-bit data path (Front Side Bus). There are 32-bit processors that have only 16-bit buses, see 32-bit computing historical perspective. Accessing a variable at an alignment other than its natural alignment may result in a performance penalty, or an alignment fault, depending on the processor, in some cases the setting of a control bit, the type of variable, the instruction used, etc.

The actual alignment is up to the compiler and the calling conventions. For structures the requirement is that the first member variable must be at offset 0 (zero) and variables must be allocated in the order they are declared, padding may be inserted between variables for alignment and after the last variable to pad the size of the structure. In 32-bit Windows the stack is only required to be 4-byte aligned, so the compiler would have to generate extra code to ensure 8-byte alignment of a double allocated on the stack.

In Agner Fog's Calling Conventions document you will find details on the alignment used in different operating systems and by different compilers. The stack has a 4-byte alignment in 32-bit Windows, which explains why you may have observed a floating point double aligned at a 4-byte but not 8-byte boundary when allocated on the stack - the compiler doesn't have a clue when a function gets called whether the stack will be 8-byte aligned or not. In table-2 of that document it shows the alignment of various data types allocated in static storage as implemented by various compilers, you will notice that in 32-bit Windows the only compiler that allows 4-byte alignment for double is the Borland compiler.

enter image description here

When allocating in a structure according to that document the Borland compiler allows double to be at any byte offset (which I find surprising).

enter image description here

Here's the text description in the document, copied here for reference

Table 3 shows the alignment in bytes of data members of structures and classes. The compiler will insert unused bytes, as required, between members to obtain this alignment. The compiler will also insert unused bytes at the end of the structure so that the total size of the structure is a multiple of the alignment of the element that requires the highest alignment. Many compilers have options to change the default alignments. Differences in structure member alignment will cause incompatibility between different programs or modules accessing the same data and when data are stored in binary files. The programmer can avoid such compatibility problems by ordering the structure members so that no unused bytes need to be inserted. Likewise, the padding at the end of the structure may be specified explicitly by inserting dummy members of the required size. The size of the virtual table pointer, if any, must be taken into account (see chapter 11).

5 Stack alignment

The stack pointer must be aligned by the stack word size at all times. Some systems require a higher alignment. The Gnu compiler version 3.x and later for 32-bit Linux and Mac OS X makes the stack pointer aligned by 16 at every function call instruction. Consequently it can rely on ESP = 12 modulo 16 at every function entry. This alignment is not consistently implemented. It is specified in the Mac OS ABI, but nowhere else. The stack is not aligned when compiling with option -Os or -mpreferred-stack-boundary=2, but apparently the Gnu compiler erroneously relies on the stack being aligned by 16 despite these options. The Intel compiler (v. 9.1.038) for 32 bit Linux does not have the same alignment. (I have submitted bug reports to Gnu and Intel about this in 2006. In 2009 Intel added a -falign-stack= assume-16-byte option to ICC version 11.0 to fix the problem). The stack is aligned by 4 in 32-bit Windows. The 64 bit systems keep the stack aligned by 16. The stack word size is 8 bytes, but the stack must be aligned by 16 before any call instruction. Consequently, the value of the stack 10 pointer is always 8 modulo 16 at the entry of a procedure. A procedure must subtract an odd multiple of 8 from the stack pointer before any call instruction. A procedure can rely on these rules when storing XMM data that require 16-byte alignment. This applies to all 64 bit systems (Windows, Linux, BSD). Where at least one function parameter of type __m256 is transferred on the stack, Unix systems (32 and 64 bit) align the parameter by 32 and the called function can rely on the stack being aligned by 32 before the call (i.e. the stack pointer is 32 minus the word size modulo 32 at the function entry). This does not apply if the parameter is transferred in a register. Various methods for aligning the stack are described in Intel's application note AP 589 "Software Conventions for Streaming SIMD Extensions", "Data Alignment and Programming Issues for the Streaming SIMD Extensions with the Intel® C/C++ Compiler", and "IA-32 Intel ® Architecture Optimization Reference Manual".

And yet, I've seen double allocated on 4-bytes boundaries, not 8 (Specifically, on Visual C 32-bits, and checking address allocated on Stack, not inside a structure) — Cyan, Sep 04 '14 at 08:35
Thank you sir. Sorry for the late reply.Sir actually my basic doubt is why we need 8 byte alignment when we are accessing the data in chunks of 4 byte. Isn't then 4 byte alignment enough for 8 byte double; by doing so we could save memory and also fetch the double type data in exactly 2 cycles of clock which is same as if the double is 8 byte aligned, and we fetch it in two clock cycles. Waiting for your response sir. — Akhil, Sep 12 '14 at 14:55
@Cyan - probably there are different alignment rules based on different OS/architecture. Some 32bit architecture make everything 4 byte aligned and some keep it to 8byte align.Both way will work fine. But 4byte aligned code is only compatible with 32 bit architecture but 8byte aligned code is compatible with both 32 bit and 64 bit architecture. Thanks for all your responses. — Akhil, Sep 12 '14 at 17:34
This is a much more complete and documented answer than when I provided my first comment. It deserves +1. — Cyan, Sep 13 '14 at 09:43
@Akhil - You can have a 32-bit CPU with 32-bit integer registers and a 32-bit data path to memory and yet have a data cache and access the data cache with a 64-bit data path, so while access to DRAM might be in 4-byte chunks, access to the data cache is in 8-byte chunks. — amdn, Sep 13 '14 at 16:28
@Cyan actually I checked in my pc. It has 8byte aligned cache.I am using 32bit OS(ubuntu 14.04). I wrote a code containing the above structure mentioned in my question and found that it is 4byte aligned after compilation. Now my question is how could my c executable containing 4byte aligned structure get into 8byt aligned cache.Thank you in advance. — Akhil, Sep 14 '14 at 18:44
@Akhil : it's unexpected, because cache line size of PC (intel/AMD) are supposed to be 64 bytes wide (note : it is 64 bytes, not 64 bits). Furthermore, note that alignment is not really related to cache line. Alignment is really for the CPU to access data with limited effort. On modern PC CPU, it does not really make a difference, but on other systems, it can result in a crash. — Cyan, Sep 14 '14 at 23:05
@amdn actually I checked in my pc. It has 8byte aligned cache.I am using 32bit OS(ubuntu 14.04). I wrote a code containing the above structure mentioned in my question and found that it is 4byte aligned after compilation. Now my question is how could my c executable containing 4byte aligned structure get into 64byte aligned cache.Thank you in advance — Akhil, Sep 15 '14 at 02:15
@Cyan oh thank you,its 64byte aligned.I misunderstood it as 64bits.So, as per you,alignment is not for cache.So, is it elimination of possibility that my code would enter the cache? Also, why should even old PC crash when you have 32bit data access, and data is 4byte aligned.Will not it work fine in old PC too ? Thank you in advance. — Akhil, Sep 15 '14 at 02:25
@Akhil : PC have typically no problem with alignment. Alignment is an issue for older ARM CPU, older MIPS, some SUN CPU, some Sony/Toshiba/IBM CPU, etc. Well, basically, not your typical PC. — Cyan, Sep 15 '14 at 14:06
@Cyan so shall the 4byte aligned structure in my executable create trouble in cache since the double type variable is not naturally aligned. — Akhil, Sep 15 '14 at 19:27

score 1 · Answer 2 · answered Sep 04 '14 at 08:59

Your comment is valid, and you'll probably get the result you are looking for if, instead of using a struct, you simply lay down the variables as part of the local stack inside a function. Something along these lines :

void alignTest()
{
  char        c;
  double      d;
  int         s;
  printf("%x %x %x", (int)&c, (int)&d, (int)&s);
}

In this example, the compiler is free to make its optimal choices performance and memory wise. Heck, it can even re-order variables if it wishes. On this setup, I've already witnessed double on 4-bytes boundaries (not 8) using 32-bits compilers.

On the other hand, using a struct, you need to keep in mind that it is part of an interface contract. It's not just a matter of the compiler selecting whatever choice it feels better : if part of an API, this struct will be used by other programs, potentially using another compiler, or another version of the same compiler. It happens all the time : think DLL, wrapper from other languages (calling a C function from a Delphi or Python program) etc.

You can't have an interface element in a "random state", with different choices depending on compiler. In this case, the allocation rules regarding variables inside a struct are set in stone by the specification.

In this specification, variable order is always respected, and double are aligned on 8 bytes.

structure padding - what is the purpose of natural alignment?

2 Answers2

5 Stack alignment