The rules are different on each processor model. I will discuss one hypothetical example. We may have a processor with an eight-byte interface to the bus. Given some address X, the processor can load eight bytes from that address by requesting the memory to deliver eight bytes from its unit of storage numbered X/8. That is, the memory does not have any way to address individual bytes. The processor can only request data at a certain address that is a multiple of eight, and the memory will send the entire eight bytes at that address. (Keep in mind this is a hypothetical example to illustrate basic principles. Also, I am ignoring cache. Cache helps mask some of the effects of alignment issues, because the misalignments can be largely managed in level-one cache inside the processor. But handling this still requires extra hardware, as discussed below.)
Suppose we want the four-byte object that is in bytes 7, 8, 9, and 10. To get this, the processor has to request unit 0 from memory, which supplies bytes 0 through 7, and it has to request unit 1, which supplies bytes 8 through 15. So, already, there is a performance problem: We had to use two bus transfers to get this word that is only half the size of one transfer. That is inefficient, and the bus can only do half as many of these double transfers as it can if we loaded only aligned data requiring single transfers.
Continuing, the processor has all the bytes it needs, 0 through 15, so it extracts bytes 7 through 10, which make up the object we want. To do this, though, it has to shift the bytes to put them into a register. Ideally, if nobody did any “unaligned” loads, four-byte objects would come in from the bus only at offsets 0 and 4 in the eight-byte transfers, and the processor only needs to have wires gong from those offsets to the register destinations.
However, our processor supports unaligned loads, so it has additional switches and wires so the data can be shunted down a different path, where it will be shifted by three bytes. Keep in mind, the data from both transfers has to be shifted by three bytes and then spliced together. So a lot of extra wires and switches are needed. Two eight-byte transfers is 128 bits, so there are hundreds of extra connections involved in this.
Well, fine, the processor has these wires and switches, why not use them? To make this processor fast, it supports multiple loads and stores in progress simultaneously. As soon as the bus transfers one piece of data, we want to be getting another from the bus, while the data from the first is still on its way to a register. So there are actually multiple parts of the processor moving data around for several loads. Since we expect unaligned loads to be rare, maybe only one of the parts for handling loads has the extra components to handle unaligned loads. The others all handle aligned loads. So, if you have just one unaligned load occasionally, the processor sends it to that part, and the performance effect is unnoticeable. However, if you do many unaligned loads in a row, they all have to go through the one part, so they end up waiting in a queue instead of running in parallel, and performance decreases.
That is just for loads. When you store that four-byte object, there is no way to write just bytes 7 through 10. Since the bus and the memory only work in eight-byte units, we need to write units 0 and 1, which also contains bytes 0 through 6 and bytes 11 to 15. To implement the store, the processor must:
- Load memory unit 0, providing bytes 0 through 7.
- Load memory unit 1, providing bytes 8 through 15.
- Move the first byte of the four-byte object into byte 7.
- Move the last three bytes of the object into bytes 8 through 10.
- Store the changed memory unit 0.
- Store the changed memory unit 1.
Again, that is twice as much work as it would be with an aligned object (load one memory unit, move the bytes in, store the unit). And, besides the time of the operations, you are occupying more resources inside the processor—it has to use two internal registers to hold the data from memory temporarily while it is merging the changes.
Actually, it is more than twice the work and resources, because it also requires extra wires and switches to shift the bytes by non-standard amounts.