1

I' was reading a book about assembly(intermediate level) and it mentioned that some instructions like xchg automatically assert the processor LOCK# signal. Searching online about it revealed that it give the processor the exclusive right of any shared memory and no specific details. Which made me wonder how does this right works.

  1. Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.
  2. How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.
  3. The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
KMG
  • 1,433
  • 1
  • 8
  • 19
  • That's mostly archaic; aligned `lock` and `xchg` instructions don't need to disturb any other cores because CPUs have MESI caches. Modern CPUs don't even have a single shared bus that could be #LOCKed. See [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850). Cache-line-split atomic RMWs are extremely expensive, and do need to do something like blocking all other memory operations by all cores and DMA, though. – Peter Cordes Jan 12 '21 at 06:10
  • Given your x86-16 tag (which was maybe originally 8086 before SO replaced it), was this book about ancient single-core 8086 CPUs? It actually did have a bus that worked this way, where `LOCK#` was an actual external pin on the chip. Probably at least 386 worked like that. – Peter Cordes Jan 12 '21 at 06:12
  • !@PeterCordes The book was actually for intel pentium 4 CPU. – KMG Jan 12 '21 at 06:16
  • 1
    Early Pentium 4 processors have a LOCK# pin, but later Pentium 4 and all subsequent Intel processors use the serial DMI bus and don't have this pin. The DMI bus is undocumented though, so may have a way for signalling a locked bus state despite not having a dedicated LOCK# pin on the CPU. – Ross Ridge Jan 12 '21 at 08:00

1 Answers1

5

The basic problem is that some instructions read memory, modify the value read, then write a new value; and if the contents of memory change between the read and the write, (some) parallel code can end up in an inconsistent state.

A nice example is one CPU doing inc dword [foo] while another CPU does dec dword [foo]. After both instructions (on both CPUs) are executed the value should be the same as it originally was; but both CPUs could read the old value, then both CPUs could modify it, then both CPUs could write their new value; resulting in the value being 1 higher or 1 lower than you'd expect.

The solution was to use a #lock signal to prevent anything else from accessing the same piece of memory at the same time. E.g. the first CPU would assert #lock then do it's read/modify/write, then de-assert #lock; and anything else would see that the #lock is asserted and have to wait until the #lock is de-asserted before it can do any memory access. In other words, it's a simple form of mutual exclusion (like a spinlock, but in hardware).

Of course "everything else has to wait" has a performance cost; so it's mostly only done when explicitly requested by software (e.g. lock inc dword [foo] and not inc dword [foo]) but there are a few cases where it's done implicitly - xchg instruction when an operand uses memory, and updates to dirty/accessed/busy flags in some of the tables the CPU uses (for paging, and GDT/LDT/IDT entries). Also; later (Pentium Pro I think?), the behavior was optimized to work with cache coherency protocol so that the #lock isn't asserted if the cache line can be put in the exclusive state instead.

Note: In the past there have been 2 CPU bugs (Intel Pentium "0xF00F" bug and Cyrix "Coma" bug) where a CPU can be tricked into asserting the #lock signal and never de-asserting it; causing the entire system to lock up because nothing can access any memory.

  1. Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.

Yes. If the #lock is asserted (which doesn't include cases where newer CPUs can put the cache line into the exclusive state instead); anything that accesses memory would have to wait for #lock to be de-asserted.

Note: Most modern devices can/do access memory directly (to transfer data to/from RAM without using the CPU to transfer data).

  1. How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.

It's not saved in the contents of any register. It's literally an electronic signal on a bus or link. For an extremely over-simplified example; assume that the bus has 32 "address" wires, 32 "data" wires, plus a #lock wire; where "assert the #lock" means that the voltage on that #lock wire goes from 0 volts up to 3.3 volts. When anything wants to read or write memory (before attempting to change the voltages on the "address" wires or "data" wires) it checks the voltage on the #lock wire is 0 volts.

Note: A real bus is much more complicated and needs a few other signals (e.g. for direction of transfer, for collision avoidance, for "I/O port or physical memory", etc); and modern buses use serial lanes and not parallel wires; and modern systems use "point to point links" and not "common bus shared by all the things".

  1. The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.

It's better to say that the bus is locked; where everything has to use the bus to access memory (and nothing else can use the bus when the bus is locked, even when something else is trying to use the bus for something that has nothing to do with memory - e.g. to send an IRQ to a CPU).

Of course (due to aggressive performance optimizations - primarily the "if the cache line can be put in the exclusive state instead" optimization) it's even better to say that the hardware can do anything it feels like as long as the result behaves as if there's a shared bus that was locked (even if there isn't a shared bus and nothing was actually locked).

Note: 80x86 supports misaligned accesses (e.g. you can lock inc dword [address] where the access can straddle a boundary), where if a memory access does straddle a boundary the CPU needs to combine 2 or more pieces (e.g. a few bytes from the end of one cache line and a few bytes from the start of the next cache line). Modern virtual memory means that if the virtual address straddles a page boundary the CPU needs to access 2 different virtual pages which may have "extremely unrelated" physical addresses. If a theoretical CPU tried to implement independent locks (a different lock for each memory area) then it would also need to support asserting multiple lock signals. This can cause deadlocks - e.g. one CPU locks "memory page 1" then tries to lock "memory page 2" (and can't because it's locked); while another CPU locks "memory page 2" then tries to lock "memory page 1" (and can't because it's locked). To fix that the theoretical CPU would have to use "global lock ordering" - always assert locks in a specific order. The end result would be a significant amount of complexity (where it's likely that the added complexity costs more performance than it saves).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Brendan
  • 35,656
  • 2
  • 39
  • 66
  • re: your last paragraph about split locks: having to go off-core at all is vastly more expensive than just taking a "cache lock" (playing tricks with MESI when you have exclusive ownership of a line), even if you do manage to avoid stalling other cores. Since misaligned atomic RMWs are very rarely if ever needed (and are often a mistake), the real-world solution is to make it easy for developers to detect accidental usage of split-locks so they can remove them. Then the hardware can stay "slow and simple" for that very case. `sq_misc.split_lock` perf event / https://lwn.net/Articles/790464 – Peter Cordes Jan 12 '21 at 09:47
  • 2
    Specifically, as https://patchwork.kernel.org/project/kvm/cover/1555536851-17462-1-git-send-email-fenghua.yu@intel.com/ points out, Intel Tremont and other future CPUs will have a feature that allows triggering a #AC (alignment check) exception on split `lock`ed accesses. So it's like setting the AC flag but only for `lock`ed operations, not breaking normal code. (And working in kernel mode, where AC was overloaded to mean something different.) – Peter Cordes Jan 12 '21 at 09:50
  • @PeterCordes: Interesting (for future software); but the real world situation is that Intel has to support all architecturally guaranteed behavior (even if they add new optional extensions that bypass it) because if they break backward compatibility they lose market share (e,g, people start saying "Welp, if the older software won't work I guess I have no reason not to switch to ARM or something else"). In other words, adding an optional "trap on split lock" extension does not prevent them from needing to deal with split lock. – Brendan Jan 12 '21 at 16:02
  • Intel / AMD have to make it work *correctly* to implement the x86 ISA, but my point was there's little need to make it work *fast*. Instead of solving that thorny problem (e.g. your proposal of adding per-page or per-line locks to allow most memory access by other cores / IO devices to continue while a split-lock operation is in progress), they can keep a LOCK#-equivalent behaviour for that case and just tell software to stop doing that. (And provide HW support to make it easy to verify that even, now even in virtualized environments without perf counter support.) – Peter Cordes Jan 12 '21 at 16:08
  • 1
    @PeterCordes: Ah. The "per-page lock" wasn't my proposal (it's part of the original asker's third question). I was mostly trying to explain that (while it's technically possible in theory) it's not very viable in practice. – Brendan Jan 12 '21 at 16:13
  • Ah, I see. OP wondered if it did work that way, you pointed out some of the engineering challenges that would exist if one were to try to make that actually possible, and how over-complicated a solution to those problems might be. Yup, that's a better answer than just "no, doesn't work that way". If fast / scalable split-lock support was actually needed in some alternate reality of computer architecture, it's a problem that would get solved somehow. Like maybe by using the low few address bits so only addresses that alias would be locked out. But yeah, fortunately not necessary. – Peter Cordes Jan 12 '21 at 16:21