A typical PC runs a x86 CPU, Intel's CPUs can perform the locking entirely on the caches:
if the area of memory being locked during a LOCK operation is
cached in the processor that is performing the LOCK operation as write-back memory and is completely contained
in a cache line, the processor may not assert the LOCK# signal on the bus.
Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically.
This
operation is called “cache locking.”
The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
From Intel Software Developer Manual 3, Section 8.1.4
The cache coherence mechanism is a variation of the MESI protocol.
In such protocol before a CPU can write to a cached location, it must have the corresponding line in the Exclusive (E) state.
This means that only one CPU at a time has a given memory location in a dirty state.
When other CPUs want to read the same location, the owner CPU will delay such reads until the atomic operation is finished.
It then follows the coherence protocol to either forward, invalidate or write-back the line.
In the above scenario a lock can be performed faster than an uncached load.
Those times however are a bit off and surely outdated.
They are intended to give an order, along with an order of magnitude, among the typical operations.
The timing for an L1 hit is a bit odd, it isn't faster than the typical instruction execution (which by itself cannot be described with a single number).
The Intel optimization manual reports, for an old CPU like Sandy Bridge, an L1 access time of 4 cycles while there are a lot of instructions with a latency of 4 cycles of less.
I would take those numbers with a grain of salt, avoiding reasoning too much on them.
The lesson Norvig tried to teach us is: hardware is layered, the closer (from a topological point of view1) to the CPU, the faster.
So when parsing a file, a programmer should avoid moving data back and forth to a file, instead it should minimize the IO pressure.
The some applies when processing an array, locality will improve performance.
Note however that these are technically, micro-optimisations and the topic is not as simple as it appears.
1 In general divide the hardware in what is: inside the core (registers), inside the CPU (caches, possibly not the LLC), inside the socket (GPU, LLC), behind dedicated bus devices (memory, other CPUs), behind one generic bus (PCIe - internal devices like network cards), behind two or more buses (USB devices, disks) and in another computer entirely (servers).