How can some architectures guarantee that aligned memory operations are atomic?

Question

As explained in this post: Why is integer assignment on a naturally aligned variable atomic on x86? :

Memory load/store on a byte value - and any correctly aligned value up to 64 bits is guaranteed to be atomic on x86.

But what if:

1- The data crosses cache line boundaries. Assume I have short a = 1234; and address of a is halfword aligned. But for some reason 2 byte data is split between 2 cache lines hence CPU needs to do extra work to fetch and concatenate. How can this remain atomic?

2- The value is paged. Assume a value which CPU is trying to fetch is properly aligned but it's not even in cache or memory. Now it needs to fetch it all the way from disk. How it this still atomic?

I like to ask a third related question while we are at it:

3- Why does the data need to be aligned to its data type at all? Why isn't it enough if it is within a cache line block as every memory load/store is done is cache line blocks and not in specific data sizes?

It's mathematically impossible for a naturally-aligned object to be split across any larger alignment boundary (including cache lines or pages), so there is no issue unless your line size is 1 byte. Being aligned means having the low n bits of the address = 0, so the start of a 64-byte cache line is automatically aligned by 4, etc. — Peter Cordes, Dec 16 '21 at 04:29
Could you elaborate? is this true for all other data types as well? — Dan, Dec 16 '21 at 04:30
(3) - it *is* enough on Intel since P6. The linked Q&A explains this (and links [Atomicity on x86](https://stackoverflow.com/q/38447226) for more details). It also mentions some of the tearing mechanisms that can exist, e.g. in getting the data between cores by transferring in smaller chunks on AMD CPUs. (See also the edit to my earlier comment) — Peter Cordes, Dec 16 '21 at 04:34
Thank you, how about point number 2. Data address could still be aligned but CPU needs to fetch it all the way from disc -> to memory -> to cache. meanwhile another threads will also have to wait while this happens? How does the mathematical way you motioned effects pages as well? — Dan, Dec 16 '21 at 04:41
(2) The store can't actually happen until after it's paged in from disk and mapped into memory again. That doesn't introduce a way for another thread / core to see half of the write and half of the old value. If it tries to execute while the page isn't mapped, the CPU just takes a #PF exception with the load or store instruction as the resume point, to be re-run after the kernel takes care of paging. — Peter Cordes, Dec 16 '21 at 04:44
Makes sense. So having an aligned address also effects paging or only effects transfer between memory and cache? — Dan, Dec 16 '21 at 04:45
Having a naturally-aligned address means it's in exactly one page, not crossing a page boundary, although even for unaligned, if either page was not mapped it would just fault. Paging is totally irrelevant for correctness of atomics, only cache-line boundaries matter. — Peter Cordes, Dec 16 '21 at 04:48
To elaborate on the math, let's say for instance you have a naturally aligned dword. Then its first byte's address is a multiple of 4, and its remaining bytes are not. Now a cache line is 64 bytes, naturally aligned, and so its first byte's address is a multiple of 64. If that first byte doesn't fall within our dword, we are good. If it does fall within our object, then since every multiple of 64 is also a multiple of 4, the first byte of the cache line must be the first byte of our dword. In that case, our dword is entirely within the cache line. In either case it doesn't cross it. — Nate Eldredge, Dec 16 '21 at 05:45
The same logic works replacing cache lines with pages (4096 is also a multiple of 4), etc. More generally, if an object's size is a power of 2 and naturally aligned, then it cannot cross any boundary of a larger power of 2. — Nate Eldredge, Dec 16 '21 at 05:46

How can some architectures guarantee that aligned memory operations are atomic?

0 Answers0