Why is this specpoline not working on Kaby lake?

Question

I'm trying to create a specpoline (cfr. Henry Wong) on my Kabe lake 7600U, I'm running CentOS 7.

The full testing repository is available on GitHub.

My version of the specpoline is as follow (cfr. spec.asm):

specpoline:
        ;Long dependancy chain
        fld1
        TIMES 4 f2xm1
        fcos
        TIMES 4 f2xm1
        fcos
        TIMES 4 f2xm1

        %ifdef ARCH_STORE
            mov DWORD [buffer], 241     ;Store in the first line
        %endif

        add rsp, 8
        ret

This version is different from the Henry Wong's one in the way the flow is diverted into architectural path. While the original version used a fixed address I pass the target into the stack.
This way, an add rsp, 8 will remove the original return address and use the artificial one.

In the first part of the function I create a long latency dependency chain using some old FPU instructions, then an independent chain that attempt to fool the CPU return stack predictor.

Description of the code

The specpoline is inserted into a profiling context using a FLUSH+RELOAD¹, the same assembly file also contains:

buffer

A continuous buffer spanning 256 distinct cache lines, each one separated by GAP-1 lines for a total of 256*64*GAP bytes.

The GAP is used to prevent hardware prefetching.

A graphical depiction follows (each index is right after the other).

timings

An array of 256 DWORDs, each entry holds the time, in core cycles, needed to access the corresponding line in the F+R buffer.

flush

A small function to touch each page (with a store, just to be sure COW is on our side) of the F+R buffer and evicts the designated lines.

'profile`

Standard profile function that uses lfence+rdtsc+lence well to profile a load from each line in the F+R buffer and store the result in the timings array.

leak

This is the function that does the real work, call the specpoline putting a store in the speculative path and the profile function in the architectural path.

;Flush the F+R lines
        call flush

        ;Unaligned stack, don't mind
        lea rax, [.profile]
        push rax
        call specpoline

        ;O.O 0
        ; o o o SPECULATIVE PATH
        ;0.0 O

        %ifdef SPEC_STORE
            mov DWORD [buffer], 241        ;Just a number
        %endif

        ud2                             ;Stop speculation

   .profile:
        ;Ll Ll
        ;  !  !  ARCHITECTURAL PATH
        ;Ll Ll
        
        ;Fill the timings array
        call profile

A small C program is used to "bootstrap" the test harness.

Running the tests

The code use pre-processors conditional to conditionally put a store in the architectural path (actually in the specpoline itself) if ARCH_STORE is defined and to conditionally put a store in the speculative path if SPEC_STORE is defined.

Both store access the first line of the F+R buffer.

Running make run_spec and make run_arch will assemble spec.asm with the corresponding symbol, compile the test and run it.

The test shows the timings for each line of the F+R buffer.

Store in the architectural path

 38    230    258    250    212    355    230    223    214    212    220    216    206    212    212    234
213    222    216    212    212    210   1279    222    226    301    258    217    208    212    208    212
208    208    208    216    210    212    214    213    211    213    254    216    210    224    211    209
258    212    214    224    220    227    222    224    208    212    212    210    210    224    213    213
207    212    254    224    209    326    225    216    216    224    214    210    208    222    213    236
234    208    210    222    228    223    208    210    220    212    258    223    210    218    210    218
210    218    212    214    208    209    209    225    206    208    206   1385    207    226    220    208
224    212    228    213    209    226    226    210    226    212    228    222    226    214    230    212
230    211    226    218    228    212    234    223    228    216    228    212    224    225    228    226
228    242    268    226    226    229    224    226    224    212    299    216    228    211    226    212
230    216    228    224    228    216    228    218    228    218    227    226    230    222    230    225
228    226    224    218    225    252    238    220    229   1298    228    216    228    208    230    225
226    224    226    210    238    209    234    224    226    255    230    226    230    206    227    209
226    224    228    226    223    246    234    226    227    228    230    216    228    211    238    216
228    222    226    227    226    240    236    225    226    212    226    226    226    223    228    224
228    224    229    214    224    226    224    218    229    238    234    226    225    240    236    210

Store in the speculative path

298    216    212    205    205   1286    206    206    208    251    204    206    206    208    208    208
206    206    230    204    206    208    208    208    210    206    202    208    206    204    256    208
206    208    203    206    206    206    206    206    208    209    209    256    202    204    206    210
252    208    216    206    204    206    252    232    218    208    210    206    206    206    212    206
206    206    206    242    207    209    246    206    206    208    210    208    204    208    206    204
204    204    206    210    206    208    208    232    230    208    204    210   1287    204    238    207
207    211    205    282    202    206    212    208    206    206    204    206    206    210    232    209
205    207    207    211    205    207    209    205    205    211    250    206    208    210    278    242
206    208    204    206    208    204    208    210    206    206    206    206    206    208    204    210
206    206    208    242    206    208    206    208    208    210    210    210    202    232    205    207
209    207    211    209    207    209    212    206    232    208    210    244    204    208    255    208
204    210    206    206    206   1383    209    209    205    209    205    246    206    210    208    208
206    206    204    204    208    246    206    206    204    234    207    244    206    206    208    206
208    206    206    206    206    212    204    208    208    202    208    208    208    208    206    208
250    208    214    206    206    206    206    208    203    279    230    206    206    210    242    209
209    205    211    213    207    207    209    207    207    211    205    203    207    209    209    207

I put a store in the architectural path to test the timings function, it seems to work.

However I'm unable to get the same result with a store in the speculative path.

Why is the CPU not speculatively executing the store?

¹ I confess I have never really spent time distinguishing all the cache profiling techniques. I hope I used the right name. By FLUSH+RELOAD I mean the process of evicting a set of lines, speculatively execute some code and then record the time to access each of the evicted line.

Your "long dep chain" is *many* uops from those microcoded x87 instructions. fcos is 53-105 uops on SKL, with 50-130 cycle throughput. So it's about 1 cycle per uop delay. If you want the most delay from uops that fit in the RS, a `sqrtpd` dependency chain is probably your best bet. — Peter Cordes, Apr 05 '19 at 21:26
@PeterCordes I initially used `sqrtpd` with similar results. However I didn't initialise the XMM register used as input (and output) thinking it didn't matter. I tested again but this time I initialised the register with two doubles of value `1e200` and what I get is an intermittent result. Sometime the line is fetched speculatively sometime it is not. It seems a `movups xmm0, [...]` followed by ten `sqrtpd xmm0, xmm0` is the best combination, though not 100% reliable. It's a bit late here, I'll update the answer tomorrow. Thank you for the input! — Margaret Bloom, Apr 05 '19 at 22:19

Peter Cordes · Accepted Answer · 2019-04-06T10:11:17.873

Your "long dep chain" is many uops from those microcoded x87 instructions. fcos is 53-105 uops on SKL, with 50-130 cycle throughput. So it's about 1 cycle per uop delay, and the scheduler / reservation-station (RS) "only" has 97 entries in SKL/KBL. Also, getting later instructions into the out-of-order back end may be a problem, because microcode takes over the front-end and needs some kind of mechanism to decide which uops to issue next, probably depending on the result of some computation. (The number of uops is known to be data-dependent.)

If you want the most delay from an RS full of un-executed uops, a sqrtpd dependency chain is probably your best bet. e.g.

    xorps  xmm0,xmm0                   ; avoid subnormals that might trigger FP assists
    times 40 sqrtsd xmm0, xmm0

    ; then make the store of the new ret addr dependent on that chain
    movd   ebx, xmm0
    ; and  ebx, 0            ; not needed, sqrt(0) = 0.0 = integer bit pattern 0
    mov [rsp+rbx], rax
    ret

Since Nehalem, Intel CPUs have fast recovery for branch misses with a Branch Order Buffer that snapshots the OoO state (including the RAT and probably RS) What exactly happens when a skylake CPU mispredicts a branch?. So they can recover exactly to the mispredict without waiting for the mispredict to be the retirement state.

mov [rsp], rax could execute as soon as it entered the RS, or at least without a dependency on the sqrt dep chain. As soon as store-forwarding could produce the value, the ret uop could execute and check the prediction, and detect the mispredict while the sqrt dep chain was still crunching. (ret is 1 micro-fused uop for load ports + port 6, where the taken-branch execution unit is.)

Coupling the sqrtsd dep chain into storing a new return address prevents the ret from executing early. Executing a ret uop in an execution port = checking the prediction and detecting the mispredict if there is one.

(Contrast with Meltdown, where the "wrong" path keeps running until the faulting load reaches retirement, and you want it to execute ASAP (just not retire). But you usually want to put a whole Meltdown attack in the shadow of something else, like TSX or a specpoline, in which case you would need something like this and have the whole meltdown in the shadow of this dep chain. Then Meltdown wouldn't need its own sqrtsd dep chain.)

(vsqrtpd ymm is still 1 uop on SKL, with worse throughput than xmm, but it has the same latency. So use sqrtsd because it's the same length and presumably more energy-efficient.)

The best-case latency is 15 cycles vs. worst-case of 16 on SKL/KBL (https://agner.org/optimize), so it hardly matters what input you start with.

I initially used sqrtpd with similar results. However I didn't initialise the XMM register used as input (and output) thinking it didn't matter. I tested again but this time I initialised the register with two doubles of value 1e200 and what I get is an intermittent result. Sometime the line is fetched speculatively sometime it is not.

If XMM0 holds a subnormal (e.g. bit pattern is a small integer), sqrtpd takes a microcode assist. (fp_assist.any perf counter). Even if the result is normal but the input is subnormal. I tested both cases on SKL, with this loop:

  pcmpeqd   xmm0,xmm0
  psrlq     xmm0, 61        ; or 31 for a subnormal input whose sqrt is normalized
  addpd     xmm0,xmm0       ; avoid domain-crossing vec-int -> vec-fp weirdness

  mov   ecx, 10000000
.loop:
    sqrtpd  xmm1, xmm0
    dec    ecx
    jnz   .loop

 mov eax,1
 int 0x80   ; sys_exit

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,fp_assist.any shows 1 assist per iteration for denormal inputs, with 951M uops issued (and ~160 cycles per iteration). So we can conclude a microcode assist for sqrtpd takes ~95 uops in this case, and has a throughput cost ~160 cycles when it happens back-to-back.

vs. 20M uops issued total for input = NaN (all-ones), with 4.5 cycles per iteration. (The loop runs 10M sqrtpd uops, and 10M macro-fused dec/jcc uops.)

I get much more consistent results replacing the `add rsp, 8` with a store to `[rsp]` to divert the execution. Just like Henry Wong did. Maybe some imperfect scheduling of the `add` or something, IDK. Anyway your observations helped me a lot. — Margaret Bloom, Apr 06 '19 at 07:27
Actually I had to make a single dependency chain with `xorps xmm0, xmm0 / TIMES 10 sqrtpd xmm0, xmm0 / movq rbx, xmm0 / mov [rsp+rbx], rax / ret`. I guess the CPU otherwise executes the store to `[rsp]` too early and corrects the Return Stack Predictor prediction before the speculative path is executed too far. — Margaret Bloom, Apr 06 '19 at 09:52
@MargaretBloom: Ahh, right. Since Nehalem, Intel CPUs have fast recovery for branch misses with a Branch Order Buffer that snapshots the OoO state (including the RAT and probably RS) [What exactly happens when a skylake CPU mispredicts a branch?](//stackoverflow.com/q/50984007). So it can recover exactly to the mispredict without waiting for the mispredict to be the retirement state. I hadn't looked super carefully at what you were doing; with Meltdown the "wrong" path keeps running until the faulting load reaches retirement, but that's not the case here. — Peter Cordes, Apr 06 '19 at 09:56
@MargaretBloom: updated my answer with that, thanks for the details on what worked. — Peter Cordes, Apr 06 '19 at 10:12

Why is this specpoline not working on Kaby lake?

Description of the code

Running the tests

1 Answers1