0

In my quest to understand how structure padding works in C in the Linux x86 environment, I read that aligned access is faster than access that is mis-aligned. And while I think I understand the reasons given for that, they all seem to come with the underlying pre-supposition that the CPU can't directly access addresses that are not a multiple of the bus width, and so, for instance, if a 32-bit bus CPU was instructed to read 4 bytes of memory starting from address "2", it would first read 4 bytes starting from address "0", mask the first two bytes, read another 4 bytes starting from address "4", mask the last two bytes, and lastly combine the two results, as opposed to just being able to read the 4 bytes at once in case they were 4 bytes aligned.

So, my question is this: Why is that pre-supposition true? Why can't the CPU directly access addresses that are not a multiple of the bus width?

Mehdi Charife
  • 722
  • 1
  • 7
  • 22
  • 1
    On modern x86, it's only slow if the access is split across two separate cache lines. Alignment makes that impossible. Inside a packed 64-byte struct, though, with the whole struct `alignas(64)` but some members misaligned, you'd still get the benefit of the byte-shifting hardware in x86 load/store units. Related: [Are there any modern CPUs where a cached byte store is actually slower than a word store?](https://stackoverflow.com/q/54217528) - the same situation applies to misaligned wider stores on non-x86 CPUs that support them at all. – Peter Cordes Nov 28 '22 at 23:47
  • 1
    Some ISAs choose not to allow misaligned loads/stores at all, so the hardware can be simpler, not needing to maybe stall the pipeline while it grabs both parts of an aligned word, and have hardware to do that combining. e.g. Intel CPUs have a performance event for `ld_blocks.no_sr` - *[The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use]*, there are a limited number of split registers in the load ports for handling loads where a later cycle is needed to grab bytes from another cache line. – Peter Cordes Nov 28 '22 at 23:49
  • 2
    *Why can't the CPU read from addresses that are not a multiple of the bus width?* Well, the simple answer is, "Because it's easier that way!" Engineering is all about tradeoffs: cost, convenience, performance. It's easier to pull data from memory one word at a time, and easier still if it's always aligned. Mandating aligned access, and pushing the work of maintaining that alignment off onto the programmer and/or the compiler, makes the hardware smaller, simpler, and cheaper, which might — or might not! — be a tradeoff you're willing to make. – Steve Summit Nov 29 '22 at 00:08
  • 1
    There are actual wires going from the CPU to memory, possibly 32 wires for 32-bit systems (plus more for signaling/control). When you read 32 bits from memory, the memory hardware puts them on those 32 wires, and they get delivered to a register in the CPU on 32 wires. As long as you are reading from four-byte-aligned addresses (32-bit aligned), everything comes in on those wires straight to the registers. If you want to read from any other address, the memory still puts its data on those 32 wires, and the CPU has to shift them around to put them different places in the register. – Eric Postpischil Nov 29 '22 at 00:21
  • 1
    @EricPostpischil: Just to clarify, that's true for a typical simplistic design without cache. Useful to illustrate a point, or historically like 386DX (32-bit bus vs. 386SX 16-bit bus). But bus width doesn't have to equal register width, e.g. P5 Pentium is widely agreed to be a 32-bit CPU, but has a 64-bit bus. (And guarantees atomicity of 64-bit loads/stores, e.g. done with x87 `fild`/`fistp`). Loading from an address will either hit in cache or trigger a burst transfer over the external bus. There might be 32 "wires" between integer load/store execution units and L1d cache, though. – Peter Cordes Nov 29 '22 at 00:32

1 Answers1

3

Technically nothing prevents you from making a machine that can address any address on the memory bus. It's just that everything is much simpler if you make the address a multiple of the bus size.

This means that for example, to make a 32-bit memory, you can just take 4x 8-bit memory chips, plug each one on a fourth of the data bus with the same address bus. When you send a single address, all 4 chips will read/write their corresponding 8-bit to form a 32-bit word. Notice that to do that, you ignore the lowest 2 address bits, since you get 32-bits for a single access, and essentially force the address to be 32-bit aligned.

 Addr  | bus addr | CHIP0   CHIP2   CHIP3   CHIP1 | Value read
@0x00 => 0b000000    0x59    0x32    0xaa    0xba   0x5932aaba
@0x04 => 0b000001    0x84    0xff    0x51    0x32   0x84ff5132

To access non-aligned words in such a configuration, you would need to send two different bus addresses to the 4 chips since maybe two chips have your value on address 0, and the two others on address 1. Which means you either need to have several address busses, or make multiple accesses anyway.

Note that modern DRAM is obviously more complex, and accesses are multiple of cache lines, so much bigger than the bus.

Overall, in most memory use cases, rounding the accesses make things simpler.

ElderBug
  • 5,926
  • 16
  • 25
  • The same principle applies to modern DRAM, though: the access units are a naturally-aligned power-of-2, e.g. a 64-byte burst transfer. (https://en.wikipedia.org/wiki/DDR4_SDRAM). So as you say, addresses can still be broken up by ignoring lower bits. Once the CPU figures out which channel of which memory controller should be doing the access, and maps the CPU-physical address to an address within that DIMM, it can send the address (in row/column ~halves) and get back 64 bytes of data for that aligned chunk of byte addresses. – Peter Cordes Nov 29 '22 at 00:39
  • @ElderBug, just to be sure, the bus here adresses each 32 bit of memory instead of each byte? – Mehdi Charife Dec 03 '22 at 20:06
  • 1
    @MehdiCharife In this configuration, yes. If all 4 chips are controlled simultaneously and each one gives 8-bits, the smallest transfer size is 32-bits. Since memory is logically addressed per byte (address 0 is first byte, address 1 is second byte), that means each accessed address must be a multiple of 4 bytes to accommodate the transfer size. – ElderBug Dec 03 '22 at 20:50
  • @ElderBug, Thanks. Could you recommend some sources where I can read about configurations that might implement things differently? – Mehdi Charife Dec 03 '22 at 21:00
  • 1
    @MehdiCharife I don't have any specific source in mind. You can find explanations about specific memory/bus architectures if you know what you are looking for. You can try learning about memory controllers. You might have more luck with older or very small systems. An 8-bit bus is more likely to not have any restriction on the address you access. Also, you have more exotic things in microcontrollers. The 8-bit PIC MCUs for example have 12-bit or 14-bit ROM, which can only be accessed by word (address 0 is first 14-bit, address 1 is second 14-bit...). – ElderBug Dec 03 '22 at 21:25