Why "any primitive object of K bytes must have an address that is a multiple of K"?

Question

Computer Systems: a Programmer's Perspective says

The x86-64 hardware will work correctly regardless of the alignment of data. However, Intel recommends that data be aligned to improve memory system performance. Their alignment rule is based on the principle that any primitive object of K bytes must have an address that is a multiple of K. We can see that this rule leads to the following alignments:
K Types
1 char
2 short
4 int, float
8 long, double, char *

Why is it that "any primitive object of K bytes must have an address that is a multiple of K"?

How is "aligned" defined or what does it mean?

On a x86-64 machine,

if an object has K bytes (such as K=2 (e.g. short) or K=4 (e.g. int, or float)), "any primitive object of K bytes must have an address that is a multiple of K" means that such an object must have an address that is a multiple of K. But isn't the object aligned, as long as its storage space falls completely between two addresses which are two consecutive multiples of 8, which is a less strict requirement than that the object must have an address that is a multiple of K?
If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?

On a X86 machine, which has 32-bit addresses,

what is the alignment rule, and Does "any primitive object of K bytes must have an address that is a multiple of K" still apply?

Thanks.

You think an `int` is aligned if it is stored as (for simplicity) address 0x3, because it would encompass 0x3 through 0x6 inclusive, which doesn't cross 0x0 or 0x8? How is that aligned? What do you think would happen in such a scenario if you had an array of said `int`s. Now the second one is unaligned (by your definition) because it straddles 0x7 and 0x8. A definition of alignment that allows half of an array's indices to be aligned, and the other half unaligned, with no weird hijinks involved in producing it, is a strange definition. — ShadowRanger, Oct 27 '18 at 02:23
@ShadowRanger Thanks. (1) What if the K of an object is smaller than 8 but not equal to 2 or 4? For example if K=6, 5? (2) On a X86 machine, what is the alignment rule, and Does "any primitive object of K bytes must have an address that is a multiple of K" still apply? — , Oct 27 '18 at 02:26
@Ben: To #2, x86 is perfectly happy to access unaligned memory, it's just going to be slower. To my knowledge, the only time alignment is *required* on x86 is for hardware atomics support and the like, otherwise it's just a really good idea if you don't like needlessly slow code. But yes, aside from unaligned access being merely undesirable, not fatal, x86 follows the same rules on what constitutes "aligned" data. — ShadowRanger, Oct 27 '18 at 02:48
@ShadowRanger On x86, is a double (K=8 bytes) object aligned if and only if its address is multiple of 4 bytes (size of an address), or 8 bytes (according to the same rule "any primitive object of K bytes must have an address that is a multiple of K" for x86-64)? — , Oct 27 '18 at 03:12
@Ben: Why would a `double`'s alignment be determined by size of pointers? You seem to keep looking for exceptions to the rule you're quoting. There are no exceptions here; if it doesn't adhere to that rule, it's unaligned. — ShadowRanger, Oct 27 '18 at 03:19
@ShadowRanger Some SIMD instructions require aligned data, too. — Raymond Chen, Oct 27 '18 at 03:44
@RaymondChen: I can't hand wave that away with my "and the like" throwaway? :-) I was skipping SIMD stuff because standard C doesn't usually expose them. I'm assuming (might be wrong here) to use them, the compiler either has to have unsafe optimizations enabled, has to compile two code paths with a check for alignment to use the SIMD code path, or has to be able to have the allocation sufficiently "close" to point of use to guarantee alignment. Only the first of those options could actually cause misbehavior, and turning on unsafe optimizations is dangerous in a sort of self-explanatory way. — ShadowRanger, Oct 27 '18 at 03:57
If a two- or four-byte object begins at an address that is one modulo eight, then a processor with an eight-byte wide bus can load or store all of its data in one bus transfer—but that does not mean the processor contains the wires and switches needed to shift that data by one byte, rather than zero or two, while transferring it from the bus to a register. Presumably Intel perceives little value in adding those wires to the design, along with their space and energy requirements. So the alignment requirement is for the supported multiples, not for any address between eight-byte boundaries. — Eric Postpischil, Oct 27 '18 at 08:59
The above is complicated by the fact that unaligned access may be supported by the hardware. But this is at some cost, in consumption of time or resources (more parts of the processor may be used to effect the access). So the necessary wires and switches are present. But causing a transfer to use them is undesired due to resource consumption, so the recommended alignment is preferred. — Eric Postpischil, Oct 27 '18 at 09:03

Antti Haapala -- Слава Україні · Answer 1 · 2018-10-30T07:18:06.627

Since this was tagged in C as well; do note that not only does the architecture make these decisions, but so do compilers. The C compiler often has its own alignment rules that mostly follow either the required or the preferred alignment of the architecture - especially when optimizing for speed. And the compiler's requirements are what you you need to worry about the most time, not the architecture requirement.

Even if the processor supports unaligned accesses, it might have a preferred alignment for multibyte objects that the C compiler can exploit. For example a compiler is allowed to know that a any int will reside at, and therefore any int * pointer will always point to - an address divisible by 4.

Now there are people who say that since x86-64 supports unaligned acccess, they can make an int * pointer that points to an address not divisible by 4 and things will work fine.

They're wrong.

There are some instructions in the x86-64 instruction set that require alignment. I.e. the "will work correctly regardless of alignment" means that these instructions too work "correctly, according to the specification, when given an unaligned access" - they raise an exception that would kill your process. The reason for having these is that they can be so much faster and require less silicon to implement than the versions that can deal with unaligned data.

And the compiler knows exactly when it is allowed to use these instructions! Whenever it sees an int * being dereferenced it knows that it can use an instruction that requires the operand be aligned at 4 bytes, should it be more effective.

See this question for a case where OP run into problem with C code that "should have been fine on x86-64 anyway": C undefined behavior. Strict aliasing rule, or incorrect alignment?

As for x86-32, the alignment requirement for doubles is generally 4 in C compilers because doubles need to be passed on stack and stack grows in 4 not 8 byte increments.

And finally:

If the K of an object is smaller than 8 but not equal to 1, 2 or 4, does "any primitive object of K bytes must have an address that is a multiple of K" still apply? For example if K=3,5,6, or 7?

There are no primitive objects with K<-{3,5,6,7} in x86.

The C standard's stance is that an alignment can only be a power of 2, and there are no gaps in arrays. Therefore an object with such a size would need to be padded upwards to its alignment requirement, or its alignment requirement must be 1.

score 0 · Answer 2 · answered Oct 27 '18 at 16:30

The rules are different on each processor model. I will discuss one hypothetical example. We may have a processor with an eight-byte interface to the bus. Given some address X, the processor can load eight bytes from that address by requesting the memory to deliver eight bytes from its unit of storage numbered X/8. That is, the memory does not have any way to address individual bytes. The processor can only request data at a certain address that is a multiple of eight, and the memory will send the entire eight bytes at that address. (Keep in mind this is a hypothetical example to illustrate basic principles. Also, I am ignoring cache. Cache helps mask some of the effects of alignment issues, because the misalignments can be largely managed in level-one cache inside the processor. But handling this still requires extra hardware, as discussed below.)

Suppose we want the four-byte object that is in bytes 7, 8, 9, and 10. To get this, the processor has to request unit 0 from memory, which supplies bytes 0 through 7, and it has to request unit 1, which supplies bytes 8 through 15. So, already, there is a performance problem: We had to use two bus transfers to get this word that is only half the size of one transfer. That is inefficient, and the bus can only do half as many of these double transfers as it can if we loaded only aligned data requiring single transfers.

Continuing, the processor has all the bytes it needs, 0 through 15, so it extracts bytes 7 through 10, which make up the object we want. To do this, though, it has to shift the bytes to put them into a register. Ideally, if nobody did any “unaligned” loads, four-byte objects would come in from the bus only at offsets 0 and 4 in the eight-byte transfers, and the processor only needs to have wires gong from those offsets to the register destinations.

However, our processor supports unaligned loads, so it has additional switches and wires so the data can be shunted down a different path, where it will be shifted by three bytes. Keep in mind, the data from both transfers has to be shifted by three bytes and then spliced together. So a lot of extra wires and switches are needed. Two eight-byte transfers is 128 bits, so there are hundreds of extra connections involved in this.

Well, fine, the processor has these wires and switches, why not use them? To make this processor fast, it supports multiple loads and stores in progress simultaneously. As soon as the bus transfers one piece of data, we want to be getting another from the bus, while the data from the first is still on its way to a register. So there are actually multiple parts of the processor moving data around for several loads. Since we expect unaligned loads to be rare, maybe only one of the parts for handling loads has the extra components to handle unaligned loads. The others all handle aligned loads. So, if you have just one unaligned load occasionally, the processor sends it to that part, and the performance effect is unnoticeable. However, if you do many unaligned loads in a row, they all have to go through the one part, so they end up waiting in a queue instead of running in parallel, and performance decreases.

That is just for loads. When you store that four-byte object, there is no way to write just bytes 7 through 10. Since the bus and the memory only work in eight-byte units, we need to write units 0 and 1, which also contains bytes 0 through 6 and bytes 11 to 15. To implement the store, the processor must:

Load memory unit 0, providing bytes 0 through 7.
Load memory unit 1, providing bytes 8 through 15.
Move the first byte of the four-byte object into byte 7.
Move the last three bytes of the object into bytes 8 through 10.
Store the changed memory unit 0.
Store the changed memory unit 1.

Again, that is twice as much work as it would be with an aligned object (load one memory unit, move the bytes in, store the unit). And, besides the time of the operations, you are occupying more resources inside the processor—it has to use two internal registers to hold the data from memory temporarily while it is merging the changes.

Actually, it is more than twice the work and resources, because it also requires extra wires and switches to shift the bytes by non-standard amounts.

score 0 · Answer 3 · answered Oct 30 '18 at 05:53

The processor bus, which is the media used to access memory is normally the processor size in bits. This means a 32bit processor normally access memory in 32bit chunks, meaning that only one memory read access is necessary to read the data from memory.

Addresses by the contrary, are byte oriented, so a double (8 bytes) normally occupies eight different contiguous memory. So to make an access to a single eight bytes data (with only one bus request) The data must begin at a single eight byte word and finish before we get to the next. For old processors this was imperative, in case you requested a memory access that is not data aligned, an exception was fired. Actual processors don't have this restriction, but beware you that in case you have for example a double in a non multiple of eight address, the processor will need to make two bus accesses (with the overhead that this implies) to get the data from memory.

For this reason (you can double or even more, the time required to execute some piece of code if all the data is unaligned, against the time required to if the data is properly aligned) the processor vendor warns you about the alignment of data.

Modern processors have several levels of caches, that are read from main memory in chunks of one cache line (64 or even more bytes) so this is not an issue. Anyway, it is good idea to have data aligned anyway, for the case you need to run your code in a non-such-advanced processor.

Why "any primitive object of K bytes must have an address that is a multiple of K"?

3 Answers3

Linked