0

I'm writing a PIN tool where I want to see speculatively executed instructions that were eventually squashed.

I.e. if a branch direction was predicted, some instructions were executed speculatively, the branch direction was resolved and the prediction was shown to be incorrect, the instructions that were executed would then be squashed and the register file would be restored.

I assume that RTN_AddInstrumentFunction only adds an instrument function to instructions that were retired (i.e. non-speculative or speculative and shown to be correct). Is there a way for me to use PIN to get access to instructions that were executed speculatively but then squashed?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Farhad
  • 516
  • 3
  • 14

2 Answers2

3

You can't do that with binary instrumentation tools like PIN, only with hardware performance counters.

PIN can only see instructions along the correct path of execution; it works by adding / modifying instructions in memory to run extra code. But this new code is still just x86 machine code that the CPU has to execute, giving the illusion of running each instruction one at a time, in program order.

Mis-speculated instructions have no architectural effect so only stuff with special access to the micro-architectural state (like performance counters) can tell you anything about them.


There are perf counters for mispredicts, like perf stat -e branch-misses to count number of branches that were mis-predicted.

Number of bad uops issued by the front-end in the shadow of a mis-speculation that have to be cancelled can be derived (on Skylake and probably other Intel) from
uops_issued.any - uops_retired.retire_slots. Both count fused-domain uops and match each other ~exactly when there's no mis-speculation of any kind (branches, memory-order mis-speculation pipelien nukes, or whatever else).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Would `PEBS` and/or `LBR` work for speculative execution? Is there a specific counter for speculative mispredicts [or could it be deduced from a combination of others]? – Craig Estey May 21 '20 at 04:04
  • @CraigEstey: I think LBR only tracks the architectural (non-speculative) path of execution. PEBS is just a way of collecting perf-counter results via a buffer instead of actually taking an interrupt for every counter rollover, especially useful for `perf record` style statistical sampling where you want to record where that happened. Instructions are somewhat serializing but you set the PMU counter to only fire every 10k events or something so you aren't like single-stepping your code with an interrupt between every instruction if you don't have PEBS. – Peter Cordes May 21 '20 at 04:13
  • IDK if it's plausible to use PEBS with a counter limit of 1 or something to get a sample for every single instruction. – Peter Cordes May 21 '20 at 04:14
  • I knew about PEBS because I had helped an OP instrument his code for it a while back. But, when I just searched for it, I also found the LBR stack [which I didn't know about]. My guess is that speculative branches have to be stored because we could have (e.g.) 4 branches taken speculatively and they'd have to be recorded when executed [speculatively or not]--If we waited until they were committed, we'd have to store 4 at once [and where would they be recorded in the meantime?]. – Craig Estey May 21 '20 at 04:26
  • So, my guess is that all branches are stored. If some get tossed due to mispredict, the LBR pointer gets rolled back, so LBR would need an extra pointer for that (e.g.) as branches are committed/retired, the "rollback" pointer is advanced. If the speculative path is squashed/abandoned, the "forward" pointer is set to the rollback pointer. That's one way. But, my guess is that by the time we get to a point where the LBR stack could be examined, all the [relevant] speculative entries have been expunged. So, still, no joy ... – Craig Estey May 21 '20 at 04:36
  • @CraigEstey: I was mixing up LBR with PT (which is only correct-path taken branches), https://lwn.net/Articles/680985/ clarified although it doesn't mention speculation. Interesting idea; LBR for a perf event on an instruction along a mis-speculated path probably does show you the speculative path that got you there. The perf interrupt handler runs non-speculatively but the perf event data would pertain to a mis-speculated RIP and so on, I'd guess. – Peter Cordes May 21 '20 at 04:43
  • 1
    @CraigEstey: re: "taken speculatively": OoO CPUs consider *every* instruction to be speculative until it reaches retirement (i.e. commit to architectural state). Any load or store could fault. Branches are treated specially on Nehalem and later, though; special hardware to snapshot the RAT on branches allows fast recovery. [What exactly happens when a skylake CPU mispredicts a branch?](https://stackoverflow.com/q/50984007). That could include snapshotting the LBR state, or in fact LBR could be just adding a couple pieces of state to that branch state buffer. – Peter Cordes May 21 '20 at 04:46
  • 1
    The performance counter events compatible with PEBS and LBR (and the Load Latency Monitoring facility) all count *retired* instructions. The hardware has to start tracking selected instructions before it is known whether they are speculative, but it does not record any records until after the instruction retires. – John D McCalpin May 21 '20 at 19:48
  • @CraigEstey: my guess was wrong, PEBS and LBR only work on events for retired instructions. See Dr. Bandwidth's comment just above this. – Peter Cordes May 21 '20 at 19:59
  • Yep. I alluded to this [based on a guess] with my "rollback" notion. The LBR has to record speculative actions as they occur (to keep up), in case they _are_ on the taken path, but has get rid of them if the speculative results have to get tossed [by _some_ sort of rollback mechanism]. From John's comment, I infer that it's a separate buffer/queue. – Craig Estey May 21 '20 at 20:53
  • _Side note:_ One of your links confirmed my long standing suspicions: that you _had_ worked for Intel on at least one arch. You had too much knowledge [and interest] otherwise. I guess your NDA has/had expired. – Craig Estey May 21 '20 at 20:56
  • 1
    @CraigEstey: Actually no, I'm just an interested amateur, never worked for Intel and never designed any CPU microarchitecture. I just try to understand them from publicly available info (especially David Kanter's writeups on https://RealWorldTech.com/, and the occasional Intel patent) and general things people have said about how CPUs are designed. Sometimes I phrase answers based on how I assume something works, especially when it seems like there's only one plausible design. Jumping to conclusions has worked well most of the time, but occasionally it turns out I'm wrong. :P – Peter Cordes May 21 '20 at 21:06
  • Oops, the link was actually: https://stackoverflow.com/a/10367322/5382650 from "Krazy Glew", but it looked like one of yours. The phrase was: _This topic is near and dear to my heart because I have proposed NOT doing this. E.g. in customer visits **while we were planning to build the P6**, I asked customers which they preferred ..._ Maybe, you _could_ work for them. Oregon is somewhat close to Canada :-) – Craig Estey May 21 '20 at 21:21
  • @CraigEstey: yeah, Andy Glew is a "famous" CPU architect who worked on P6 at Intel, and later moved to other companies. It's cool that he's answered some SO questions with some details we never would have guessed from outside, like why `adc [mem], reg` is [one more uop than you'd expect](https://stackoverflow.com/posts/comments/68191840). – Peter Cordes May 21 '20 at 21:30
1

You can't do that with PIN and Peter has already covered the details well.

You could, however, do it with a simulation tool such as gem5. Gem5, in particular, supports both simulating x86, and reporting speculative instructions. Of course, the results you'll get are simulated, so the accuracy wrt real hardware will only be as good as the simulation itself.

A hybrid hardware/simulation approach would be to record the actual application using Intel Processor Trace, which includes information about mispredicted branches. Then run, your process again in the simulator, but refer to the metadata about mispredicted branches to hint to the simulator which branches are mispredicted.

This only works (almost) exactly for direct or conditional branches1, which have only 1 or 2 options, so the direction a mispredict takes is evident. For indirect jumps with more than two targets, you'll have to guess what target was mispredicted.


1 In fact, you can also get mispredictions to arbitrary addresses for direct and conditional branches when there are collisions in the predictors.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386