Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

Question

So i did find some answers related to this in stackvoerflow but non of them clearly answered this

so if our memory is byte addressable and the word size is for example 4 byte, then why not make the memory byte addressable?

if i'm not mistaking CPU will work with words right? so when the cpu tries to get a word from the memory what's the difference between getting a 4 byte word from a byte addressable memory vs getting a word from word addressable memory?

What does "memory" mean in your question? The actual RAM chips? (what kind?) DIMMs? The view of memory that a CPU exposes to code running on it? — harold, Feb 16 '18 at 17:02

Hadi Brais · Accepted Answer · 2018-08-20T07:11:38.307

6

if i'm not mistaking CPU will work with words right?

It depends on the Instruction Set Architecture (ISA) implemented by the CPU. For example, x86 supports operands of sizes ranging from a single 8-bit byte to as much as 64 bytes (in the most recent CPUs). Although the word size in modern x86 CPUs is 8 or 4 bytes only. The word size is generally defined as equal to the size of a general-purpose register. However, the granularity of accessing memory or registers is not necessarily restricted to the word size. This is very convenient from a programmer's perspective and from the CPU implementation perspective as I'll discuss next.

so when the cpu tries to get a word from the memory what's the difference between getting a 4 byte word from a byte addressable memory vs getting a word from word addressable memory?

While an ISA may support byte addressability, a CPU that implements the ISA may not necessarily fetch data from memory one byte at a time. Spatial locality of reference is a memory access pattern very common in most real programs. If the CPU was to issue single-byte requests along the memory hierarchy, it would unnecessarily consume a lot of energy and significantly hurt performance to handle single-byte requests and move one-byte data across the hierarchy. Therefore, typically, when the CPU issues a memory request for data of some size at some address, a whole block of memory (known as a cache line, which is usually 64-byte in size and 64-byte aligned) is brought to the L1 cache. All requests to the same cache line can be effectively combined into a single request. Therefore, the address bus between different levels of the memory hierarchy does not have to include wires for the bits that constitute an offset within the cache line. In that case, the implementation would be really addressing memory at the 64-byte granularity.

It can be useful, however, to support byte addressability in the implementation . For example, if only one byte of a cache line has changed and the cache line has to be written back to main memory, instead of sending all the 64 bytes to memory, it would take less energy, bandwidth, and time to send only the byte that changed (or few bytes). Another situation where byte addressability is useful is when providing support for the idea of critical-word first. This is much more to it, but to keep the answer simple, I'll stop here.

DDR SDRAM is a prevalent class of main memory interfaces used in most computer systems today. The data bus width is 8 bytes in size and the protocol supports only transferring aligned 8-byte chunks with byte enable signals (called data masks) to select which bytes to write. Therefore, main memory is typically 8-byte addressable. It is the CPU that provides the illusion of byte addressability.

edited Aug 20 '18 at 07:11

answered Feb 17 '18 at 18:01

Hadi Brais

22,259
3
54
95

1

Fun fact: 8088 had no cache and an 8-bit data bus (vs. 16-bit on the otherwise-identical 8086), so it really did do all its memory access one byte at a time. So that's a real-world example of a CPU with a 16-bit word size. But that design choice is simply to make it cheaper to build (and to build systems around: fewer data bus pins on the CPU package / fewer traces on the motherboard). 8086 has a 16-bit data bus and is otherwise identical. – Peter Cordes Feb 17 '18 at 18:37
Thanks, @PeterCordes. Although not sure if you were pointing out an error in my answer. The word size is not necessarily equal to the data bus width. The 8088 had 16-bit general-purpose registers. – Hadi Brais Feb 17 '18 at 18:43
Correction to your 2nd-last paragraph: dirty/clean is tracked on a per-line basis. Yes it would be cheaper to send only the changed byte to DRAM, but with a write-back cache there's no transfer to DRAM until after the store is done and the address / width forgotten. Some caches may have dirty/clean bits with smaller granularity than the whole cache line to optimize transfers between caches, but AFAIK this isn't done in any Intel or AMD designs. For write-back to DDR DRAM, a shortened burst is not much faster than a full burst of 8 chunks. – Peter Cordes Feb 17 '18 at 18:43
No, I wasn't point out an error, just adding an example of a CPU designed the way the OP was wondering about. But not because it's impossible to do otherwise. And yes, like I said 8088 and 8086 are identical other than bus width (and apparently a 4-byte instruction prefetch buffer instead of 6 byte), so yeah, 16-bit integer registers in first-gen x86. – Peter Cordes Feb 17 '18 at 18:45
Re: last paragraph. I was under the impression that DDR SDRAM could store a single byte without the memory controller having to do a read-modify-write cycle. I thought there were "enable" lines that could be left unset to effectively do a masked store, and the memory controller could use a short burst + setting the right enable lines to do a byte store. But I haven't looked at that part of [What Every Programmer Should Know About Memory?](https://stackoverflow.com/questions/8126311/47714514#47714514) or the DDR wiki article in that much detail recently. – Peter Cordes Feb 17 '18 at 18:48
1

x86 can definitely *architecturally* store a single byte without disturbing adjacent bytes, but the difference between a pure store vs. read-modify-write is only observable in MMIO memory, where it definitely can just do a single-byte write. Hmm, maybe a microbenchmark of byte-store vs. qword store to uncacheable memory. Or in regular memory with `movnti` 64-bit vs. 32-bit stores: if a 32-bit store requires a read-modify-write in the memory controller, but a 64-bit store doesn't, it would be slower. Probably have to use varying locations to rule out merging before hitting actual DRAM. – Peter Cordes Feb 17 '18 at 18:52
https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory. I should probably put more of these comments into my answer; it was just a quick answer before I knew what the OP was actually wondering. – Peter Cordes Feb 17 '18 at 18:53
@PeterCordes The tracking granularity of one bit is only according to the conveniently overly simplified textbooks and research papers. What really happens in modern processors has not been disclosed publicly. It can be much sophisticated than that. I was really just giving an example and not stating a rigid fact. Regarding DDR SDRAM, I'm not exactly sure. AFAIK, the enable lines are controlled per rank, not selectively per chip. If it was per chip, then modern chips typically have 16-bit wide data buses, not 8-bit ones. – Hadi Brais Feb 17 '18 at 18:56
1

Modern Intel CPUs have a 32-byte data path between L2 and L3 cache, so there's little benefit to any granularity narrower than that. Making transfers to/from L3 variable between 1 or 2 cycles doesn't seem worth it vs. fixed 2-cycle. Skylake has a 64-byte path between L2 and L1D, i.e. a whole cache line at once. With narrower internal data paths, I could see the benefit of some extra bits for dirty/clean, though. – Peter Cordes Feb 17 '18 at 19:04
Or I guess if you're really optimizing for access patterns with poor spatial locality for stores, tracking clean parts of a line all the way to write-back helps. And support for that would also let you implement uncacheable stores propagating through the memory hierarchy, although you can't just put garbage in the surrounding bytes and treat it as a mostly-clean line because you can't let those garbage bytes be read by anything. So IDK. – Peter Cordes Feb 17 '18 at 19:04
@PeterCordes I agree. The access latency of L3 is relatively high and saving one or two cycles may not justify the additional area overhead and the increased deign complexity to support moving data at multiple granularities. However, at the L2 level, doing that makes sense. Ultimately, one has to run accurate simulations to figure out what works best. – Hadi Brais Feb 17 '18 at 19:11
1

Intel's L2 caches aren't inclusive, so optimized write-back of only the dirty parts of a line from L1D to L2 would need to handle the case where the line isn't present in L2. This would require a tag-check in L2 before the data is sent, or a mechanism to hold onto the rest of the data so L2 could request it if it turns out the rest of the line wasn't still present. (It normally is still present, so a slow fallback for this case wouldn't be a performance killer, but it's a ton of extra complexity. You need L1D eviction to be fast to not exhaust line-fill buffers for loads). – Peter Cordes Feb 17 '18 at 19:18
And like I said, the most recent designs transfer all 64 *bytes* at a time between L1D and L2, so there's absolutely no benefit (for write-back to L2) to having finer-grained tracking. I just checked, and even Haswell has a 64-byte wide path from L2 to L1D, so it wasn't new in SKL. But I think SKL has higher L2 bandwidth (combined read+write)? L2 *sustained* read bandwidth is limited by latency / concurrency, so it's hard to benchmark. Related: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346 – Peter Cordes Feb 17 '18 at 19:28
I'm pretty sure some microarchitectures do track dirty with finer granularity, but AFAIK no *x86* microarchitectures do. I haven't seen in mentioned in http://agner.org/optimize/, or Intel's optimization manual, or anywhere else. I haven't read as much about AMD, though. I'm not sure if x86's strong memory-ordering has any implications for this, but maybe so if more weakly-ordered machines can do cache-to-cache transfers that let some cores see stores in different orders than other cores. – Peter Cordes Feb 17 '18 at 19:31
@PeterCordes Having a data bus width of N bits does not mean that all the wires have to be active. It depends on the exact design. There are many, many possibilities. Yes you right about the bus width, but that says very little about how the bus works and exactly what metadata is being held with the cache lines. – Hadi Brais Feb 17 '18 at 19:34
@PeterCordes Having a 64-byte data bus means that it CAN transfer 64 bytes at a time, but not necessarily HAS to. – Hadi Brais Feb 17 '18 at 19:38
I think adding logic to make some of the transfer conditional would cost more than just powering all 512 data lines (or however it's physically designed internally; presumably with a big parallel bus and not double-pumped or anything). Good point that 64B / cycle transfers doesn't mean you can't optimize with finer-grained granularity, but I still think that's not how it works in Intel's actual designs. – Peter Cordes Feb 17 '18 at 19:39
@PeterCordes It's hard for me to say without running accurate simulations. But at least if you think from the energy-is-number-one-priority perspective, it may make sense. But again, we have to measure it. There are a lot of factors. How long are the wires? What's the frequency? What's the technology node size? How frequently single-byte modifications occur? What's the target market for the processor? and so on. – Hadi Brais Feb 17 '18 at 19:44

score 1 · Answer 2 · answered Feb 16 '18 at 13:14

1

memory normally is byte-addressable. But whole-word loads are possible, and get 4x as much data in the same time.

There's basically no difference, if the word load is naturally aligned; the low bits of the address are zero instead of being not present.

answered Feb 16 '18 at 13:14

Peter Cordes

328,167
45
605
847

But how does the fetching of a 4 byte word works in a byte addressable memory? like do i need to access the memory 4 times for each byte? considering the memory is byte addressable and every address contains 1 byte? like for example if want the the first word of the memory in address 0, i should fetch the addresses 0,1,2,3 right? because the memory is byte addressable then each word must have 4 addresses! – John Pence Feb 16 '18 at 16:05
@JohnPence: Thanks for clarifying what you were getting hung up on. Yes, wide loads load get data from multiple bytes, each of which could be addressed separately with a narrow load. You could think of it as multiple byte-transfers, but they happen *in parallel* on CPUs with a data bus that's at least 1 word wide. (For example, the data path between L1D cache and the load/store execution units in Haswell is 256 bits wide: 32 *bytes*, so even AVX SIMD vector loads are a single operation). [How can cache be that fast?](https://electronics.stackexchange.com/questions/329789/329955#329955) – Peter Cordes Feb 17 '18 at 00:41
@John: Also note that data is transferred between cache and main memory in bursts of whole cache lines (typically 64 bytes). **Think of the address as the *starting point* for a multi-byte transfer.** The low bits of the address are an offset into a cache line, while the upper bits select a cache line. [this answer](https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory/46733018#46733018) about how CPUs could actually store a single byte to DRAM (instead of into cache like normal) is interesting. You might also want to read my answer there... – Peter Cordes Feb 17 '18 at 00:47
@John: For more details on how address bits lines select memory locations in *actual* CPUs with actual data buses, see [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf), (and my 2017 commentary on what's changed since it was published: https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory/47714514#47714514). For aligned word loads, requiring low address bits to be zero instead of not having them exist at all is not a big difference. It just allows extra functionality of doing a narrow load. – Peter Cordes Feb 17 '18 at 00:51
And BTW, the existence of byte-loads vs. word loads is what makes Endianness a thing. https://en.wikipedia.org/wiki/Endianness – Peter Cordes Feb 17 '18 at 00:51

Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

2 Answers2