Better understanding of timing and pipelining

Question

In this code, I'm just looping through the set of instructions a bunch of times. Without regard to how many times (100, 1000, 1000000), the timing using RDTSC shows (outputs) 6 clock cycles for the loop. I'm on a Coffee Lake I9-9900K

There are 13 instructions in the loop- I would have thought the minimum RDTSC delta would have been 13.

Would someone be able to educate me as to how this is seeming to run twice as fast as I expected it to? I'm clearly misunderstanding something basic, or I've made a ridiculous mistake.

Thank you!

    rng.SetFloatScale(2.0f / 8.0f);
00C010AE  vmovups     ymm4,ymmword ptr [__ymm@3e0000003e0000003e0000003e0000003e0000003e0000003e0000003e000000 (0C02160h)]  

    Vec8f sum = 0;
    const size_t loopLen = 1000;

    auto start = __rdtsc();
00C010BB  rdtsc  
00C010BD  mov         esi,eax  
        sum += rng.NextScaledFloats();
00C010F0  vpslld      ymm0,ymm2,xmm5  
00C010F4  vpxor       ymm1,ymm0,ymm2  
00C010F8  vpsrld      ymm0,ymm1,xmm6  
00C010FC  vpxor       ymm1,ymm0,ymm1  
00C01100  vpslld      ymm0,ymm1,xmm7  
00C01104  vpxor       ymm2,ymm0,ymm1  
00C01108  vpand       ymm0,ymm2,ymmword ptr [__ymm@007fffff007fffff007fffff007fffff007fffff007fffff007fffff007fffff (0C02140h)]  
00C01110  vpor        ymm0,ymm0,ymmword ptr [__ymm@4000000040000000400000004000000040000000400000004000000040000000 (0C021A0h)]  
00C01118  vmovups     ymm1,ymm4  
00C0111C  vfmsub213ps ymm1,ymm0,ymmword ptr [__ymm@3e8000003e8000003e8000003e8000003e8000003e8000003e8000003e800000 (0C02180h)]  
00C01125  vaddps      ymm3,ymm1,ymm3  
    for (size_t i = 0; i < loopLen; i++)
00C01129  sub         eax,1  
00C0112C  jne         main+80h (0C010F0h)  
    auto end = __rdtsc();
00C0112E  rdtsc  
00C01130  mov         edi,eax  
00C01132  mov         ecx,edx  

    printf("\n\nAverage: %f\nAverage RDTSC: %ld\n", fsum, (end - start) / loopLen);

RDTSC counts [in reference cycles, not core clock cycles](https://stackoverflow.com/a/51907627/224132), but assuming your reference frequency is similar to the max-turbo that will be similar. Also assuming your loop is long enough to hide OoO exec effects (rdtsc isn't serializing at all). More importantly there's instruction-level parallelism so you actually only bottleneck on the latency of the loop-carried dep chain. Not easy to follow in this code, but presumably goes through the VFMSUB (4 cycles) and 2 other 1-cycle instructions, or there's SIMD-int / FP bypass latency? — Peter Cordes, Mar 19 '21 at 17:06
If not for that bottleneck, your loop with 10 uops for the SIMD/FP execution units could probably execute in about 10/3 = 3.33 cycles per iter on your Skylake-derived microarchitecture, since it looks like a fairly good distribution of types of uops for the various execution ports. (i.e. not all fma / addps which could only run on ports 0 or 1) — Peter Cordes, Mar 19 '21 at 17:07
Isn't the fact that it's one long dependency chain from 00C010F0 thru 00C01125 mean that the use of the "reciprocal throughput" value isn't used to measure? Maybe asked differently, is it faster because it's a chain that takes the output of the previous instr and uses it immediately again in the next instr, said next instr often executable on a different port? — Veldaeven, Mar 19 '21 at 17:15
It looks like the bottleneck are the first 6 operations (the first depends on `ymm2` which is calculated by the sixth operation from the previous loop). This looks like an XOR-Shift pseudo-RNG. — chtz, Mar 19 '21 at 17:15
@chtz: Oh yes, I see, a 6-cycle chain of integer SIMD, and each iteration of that does an independent AND/OR / FMSUB to create floats, feeding a 4-cycle `vaddps` dep chain to accumulate those random floats. So the critical path is the integer work. Veldaeven: note that the top of the loop only reads YMM2, not the YMM1, YMM0, or YMM3 outputs created by later parts of that dep chain. So yes there's a longer dep chain, but only the first 6 instructions of it are loop-carried. The rest "forks off" independently every iteration, feeding into a separate `vaddps` dep chain. — Peter Cordes, Mar 19 '21 at 17:26
And BTW, `vpslld y,y,x` is not single-uop on your Coffee Lake (https://uops.info/); when I first looked, I thought it was an immediate shift count which is more efficient. Actually, @chtz, even the latency is higher: https://uops.info/html-lat/CFL/VPSLLD_YMM_YMM_XMM-Measurements.html#lat2-%3E1 says it's 3-cycle latency from operand 2->1 (input ymm -> output ymm) on Coffee Lake. (It's 4 cycles from the xmm shift count to the ymm output, but that's loop invariant so OoO exec can run that uop early). Perhaps your turbo vs. RDTSC frequencies are creating an 8 vs. 6 effect? — Peter Cordes, Mar 19 '21 at 17:31
So you could / should fix this by making your shift counts in your PRNG `const int` or whatever so the compiler can properly use immediates instead of having to load the count into an XMM register. Or if you're using intrinsics, use `_mm256_slli_epi32`, especially if this is MSVC which is too literal-minded with intrinsics to constant-propagate through a `__m128i` arg. — Peter Cordes, Mar 19 '21 at 17:33
I went off and did a lot of testing (removing stuff, adding stuff, checking the deltas)... If I remove the FMA and the succeeding vaddps, there is zero difference in the recorded RDTSC. This makes me think that (as @PeterCordes noted) there are two chains, one loop carried. Once I looked at it this way, it made more sense. BTW- The values are indeed "const int" declared, yet the compiler still put them in YMM. No idea why. — Veldaeven, Mar 19 '21 at 22:50
@chtz, its a standard 32-bit xorshift that's then massaged into scaled floats at the bit level. its functionally accurate, but i just didn't quite see the two separate chains that Peter noted. — Veldaeven, Mar 19 '21 at 22:56
@Veldaeven: There are two loop carried dep chains: one through the 6 SIMD-integer instructions, one through `vaddps`. The vaddps chain is shorter (4 cycles on Skylake) so it's not the critical path. (Actually, `sub eax,1` is also loop-carried, but has only 1 cycle latency, and can run on port 6 where it doesn't compete with any of the SIMD instruction.) — Peter Cordes, Mar 19 '21 at 23:27
Re: immediate shift: IDK what you're doing wrong, because GCC and MSVC are both fine with turning a `const int` into an immediate for `_mm256_slli_epi32`. https://godbolt.org/z/WxxTGs. (Although interestingly, neither one requires the count to be a compile-time constant expression, despite the definition of the intrinsic as being for the immediate version.) Perhaps you're passing a count as a function arg instead of template param, and that's getting MSVC to make worse asm? I'm pretty sure GCC would still have no trouble doing constant propagation after inlining. — Peter Cordes, Mar 19 '21 at 23:33

Better understanding of timing and pipelining

0 Answers0