-2

I am trying to learn how memory is arranged and handled by a computer, and I don't catch the alignment concept.

For instance, in a 32-bit architecture, why do we say that short (2 bytes) are unaligned if they fit entirely within a single 32-bit word, even if they are not located at an even address?

Because if the processor reads 32 bits by 32 bits and a char is at address x0 then is followed by a short (address x01 and x02) then is followed by another char (x03). Suddenly there is no problem since there will be no cut data since the processor reads 4 bytes.

So the short is aligned, isn't it?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Progear
  • 157
  • 8
  • 2
    What happens if the short is not in bytes 1 and 2 but is in bytes 3 and 4? – Eric Postpischil Feb 09 '20 at 00:50
  • There will be a problem because in this case the processor will have to read 2 times to get the information. But in the case that I specify there is no unalignement – Progear Feb 09 '20 at 00:54
  • Basically a duplicate of the followup [how does the processor read memory?](//stackoverflow.com/q/60133064) – Peter Cordes Feb 09 '20 at 06:59

2 Answers2

4

The question suggests a processor that has 32 wires connected to a bus, for data, with possibly other wires for control. When it wants data from memory, it puts an address on the bus, requests a read from memory, waits for the data, and reads it through those 32 wires.

In typical processor designs, those 32 wires are connected to some temporary internal register which itself has connections to other registers. It is easy to move those 32 bits around as a block, with each bit going on its own wire.

If we want to move some of the bits within the 32, we need to shift them. This might be done with various hardware, such as a shifting unit that we put bits into, request a certain amount of shift, and read a result from. Internally, that shifting unit will have a variety of connections and switches to do its job.

Typically, such a shifting unit will be able to move eight bits from any of four positions (starting at bits 0, 8, 16, or 24) to the base position (0). That way, an instruction such as “load byte” can be effected by reading 32 bits from memory (because it only comes in 32-bit chunks), then using the shifting unit to get the desired byte. That shifting unit might not have the wires and switches needed to move any arbitrary set of bits (say, starting at 7, 13, or 22) to the base position. That would take many more wires and switches.

The processor also needs to be able to effect a load-16-bits instruction. For that, the shifting unit will be able to move 16 bits from positions 0 or 16 to position 0. Certainly the engineers could design it to also move 16 bits from position 8 to position 0. But that requires more wires and switches, which cost money, silicon, and energy. In many processors, a decision was made that this expense was not worthwhile, so the capability is not implemented.

In consequence, the hardware simply cannot shift data from bytes 1 and 2 to bytes 0 and 1 in the course of the loading process. (There might be other shifters in the processor, such as in a general-purpose logic unit for implementing shift instructions, but those are generally separate and accessed through instruction dispatching and control mechanisms. They are not in the line of components used in loading from memory.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Thank you for this explanation. But, if I understood correctly the processor will take these 32 bits and interpret them as being a single value. But does that mean that a char in memory will take 8 bytes of space, right? – Progear Feb 09 '20 at 01:08
  • 1
    Many processors allow unaligned access with the performance penalty. – 0___________ Feb 09 '20 at 01:09
  • 1
    @Progear No, one byte reads or writes are by definition always aligned. – 0___________ Feb 09 '20 at 01:12
  • Why ? Because if my processor takes 32 bytes at a time it will take accidentally other char and have a bad value ! Isn't it ? – Progear Feb 09 '20 at 01:22
  • 1
    @Progear: When the process is reading a byte, it will read the 32-bit word the byte is in. Those 32 bits go from the bus into a shift unit that is in line with the load unit. The desired 8 bits come out of the shift unit and go into the general processor register (or equivalent) that they are being loaded into. The processor may be designed so that this has no performance penalty compared to reading a whole 32-bit word—the shifting unit might always be an available part of the load path. – Eric Postpischil Feb 09 '20 at 01:28
  • Okay, but why not do that with the short that follow the char? – Progear Feb 09 '20 at 01:50
  • @Progear: It requires more wires and switches in the shift unit, and they were not put in because they cost silicon, time, and/or energy. – Eric Postpischil Feb 09 '20 at 02:11
2

Alignment is a definition. Assuming 8 bit bytes and the memory is byte addressable. an 8 bit byte (unsigned char) cannot be unaligned. a 16 bit halfword to be aligned must have the lsbit zero. A 32 bit word the lower two bits zero, 64 bit doubleword three bits zero and so on. So if your 16 bit unsigned short is on an odd address then it is unaligned.

A "32 bit system" does not mean a 32 bit bus, bus widths do not necessarily match the size of the processor registers or instruction size or whatever. No reason to make that assumption. Saying that though, if you are talking MIPS or ARM then yes the buses are most likely 32 or 64 bit for their 32 bit register processors and 64 or perhaps 128 for 64 bit processors, likely 64 bit. But an x86 has 8 bit instructions with 8,16,32,64 bit registers and variable length instructions when you add up the bytes it can possibly take, there is no way to classify its sizes is it an 8 bit processor with its 8 bit instructions 32 or 64 due to its larger register sizes or 128,256,512 etc due to its bus sizes?

You mentioned 32, let's stick with that. I want to walk through an array of bytes, I want to do writes. I have a 32 bit wide data bus one of the typical designs you see today. Let's say the other side is a cache and it is built of 32 bit wide srams to line up with the processor side bus, we won't worry about how the dram is implemented on the other side. So you will likely have a write data bus, a read data bus and either separate write address and read address or one address bus with a way to indicate a read/write transaction.

As far as the bus is concerned all transactions are 32 bit, you don't necessarily expect the unused byte lanes to float, z state, you expect them to be high or low for valid clocks on that bus (Between valid clock cycles sure the bus may go high-z).

A read transaction will typically be and let's assume be an aligned address to the bus width so a 32 bit aligned address (either on the bus or on the far side). There isn't usually a notion of byte lane enables on a read, the processor internally isolates the bytes of interest and discards the others. Some have a length field on the address bus where it makes sense. plus cache control signals and other signals.

An aligned 32 bit read would be say address 0x1000 or 0x1004 length of 0 (n-1), the address bus does its handshake with a unique transaction id, later on the read data bus ideally a single clock cycle will contain that 32 bits of data with that id, the processor sees that and completes the transaction (might be more handshaking) and extracts all 4 bytes and does what the instruction said to do with them.

A 64 bit access aligned on a 32 bit boundary would have a length of one, one address bus handshake, two clocks cycles worth of data on the read data bus. A 16 bit aligned transaction at 0x1000 or 0x1002 will let's say be a read of 0x1000 and the processor will discard either lanes 0 and 1 or lanes 2 and 3, some bus designs align the bytes on the lower lanes so you might see a bus where the two bytes always come back on lanes 0 and 1 for a 16 bit read.

An unaligned 32 bit read would take two bus cycles, twice the overhead, twice the number of clocks a 0x1002 32 bit read is one 0x1000 read where the processor saves 2 of the bytes, then a 0x1004 read and the processor saves two of those byte combines them into the 32 bit number and then does what the instruction says so instead of 5 or 8 or whatever the minimum is for this bus it is now twice as many and likely not interleaved but back to back.

An unaligned 16 bit at address 0x1001 would be a single 32 bit read hopefully but an unaligned 16 bit read at address 0x1003 is two transactions now twice the clocks twice the overh head one at 0x1000 and one at 0x1004 saving one byte each.

Writes are the same but with an additional penalty. Aligned 32 bit writes, say at 0x1000 one bus transaction, address, write data, done. The cache being 32 bits wide in this example could simply write those 32 bits to sram in one sram transaction. An unaligned 32 bit write say at 0x1001, would be two complete bus transactions as expected taking twice the number of bus clocks but also the sram will take two or more number of clocks as well because you need to read-modify-write the sram you can't just write. in order to write the 0x1001 to 0x1003 bytes you need to read 32 bits from sram, change three of those bytes not changing the lower one, and write that back. Then when the other transaction comes in you write the 0x1004 byte while preserving the other three in that sram location.

All byte writes are a single bus transaction per, but all also incur the read-modify-write. Note that depending on how many clocks the bus takes and how many transactions you can have in flight at a time, the read-modify-write of the sram might be invisible you might not be able to get data to the cache fast enough to have a bus transaction have to wait on the sram read-modify-write, but in another similar question since this has been asked so many times here, there is a platform where this was demonstrated.

So you can now tell me how the 16 bit write transactions are going to go, they also incur the read-modify-write at the cache for every one of them, if the address is say 0x1003 then you get two bus transactions and two read-modify-writes.

One of the beauties of the cache though is that even though drams come in 8, 16, 32 bit parts (count how many chips are on a dram stick, often 8 or 9, 4 or 5 or 2 or 3 or some multiple of those. 8 is likely a 64 bit wide bus 8 bits per part, 16 64 bit wide, 8 bits per part, dual rank and so on) the transactions are done in 32 or 64 bit widths, which is kind of the point of a cache. If we were to have to do a read-modify-write at the drams slow speeds that would be horrible, we read-modify-write at the cache/sram speed, then all transactions, cache line evictions and fills are at multiples of the dram bus width so 64 or 2x64 or 4x64 etc per cache line.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168