2

I'm attempting to write C code that masks CPU op latency by using pipelining. Here is an excerpt:

__m256  v256f_rslt_0 = _mm256_loadu_ps(&ch_results_8[pos + (0 * FLOATS_IN_M256)]);
__m256  v256f_rslt_1 = _mm256_loadu_ps(&ch_results_8[pos + (1 * FLOATS_IN_M256)]);
__m256  v256f_rslt_2 = _mm256_loadu_ps(&ch_results_8[pos + (2 * FLOATS_IN_M256)]);
__m256  v256f_rslt_3 = _mm256_loadu_ps(&ch_results_8[pos + (3 * FLOATS_IN_M256)]);

__m256  v256f_scale_0 = _mm256_loadu_ps(&cl_8[pos + (0 * FLOATS_IN_M256)]);
__m256  v256f_scale_1 = _mm256_loadu_ps(&cl_8[pos + (1 * FLOATS_IN_M256)]);
__m256  v256f_scale_2 = _mm256_loadu_ps(&cl_8[pos + (2 * FLOATS_IN_M256)]);
__m256  v256f_scale_3 = _mm256_loadu_ps(&cl_8[pos + (3 * FLOATS_IN_M256)]);

v256f_rslt_0 = _mm256_max_ps(v256f_rslt_0, v256f_c_zero);
v256f_rslt_1 = _mm256_max_ps(v256f_rslt_1, v256f_c_zero);
v256f_rslt_2 = _mm256_max_ps(v256f_rslt_2, v256f_c_zero);
v256f_rslt_3 = _mm256_max_ps(v256f_rslt_3, v256f_c_zero);

v256f_rslt_0 = _mm256_mul_ps(v256f_rslt_0, v256f_scale_0);
v256f_rslt_1 = _mm256_mul_ps(v256f_rslt_1, v256f_scale_1);
v256f_rslt_2 = _mm256_mul_ps(v256f_rslt_2, v256f_scale_2);
v256f_rslt_3 = _mm256_mul_ps(v256f_rslt_3, v256f_scale_3);

There are 5 math ops * 4; 2 are shown.

However, the compiler destroys the pipelining. Here's a portion of the ASM:

vmaxps  ymm2, ymm0, ymm10
vmulps  ymm0, ymm2, YMMWORD PTR [r9+rax-96]
vminps  ymm2, ymm0, ymm7
vmovups ymm0, YMMWORD PTR [rax-64]
vmulps  ymm6, ymm3, ymm8
vsubps  ymm3, ymm7, ymm2

vmaxps  ymm2, ymm0, ymm10
vmulps  ymm0, ymm2, YMMWORD PTR [r9+rax-64]
vminps  ymm2, ymm0, ymm7
vmovups ymm0, YMMWORD PTR [rax-160]
vmulps  ymm5, ymm3, ymm8
vsubps  ymm3, ymm7, ymm2

The compiler has clearly grouped the code into 4 blocks, which means maximum latency will occur.

Compiler optimizations: /O2 /Oi /Ot /GL Linker optimizations: /OPT:REF /OPT:ICF /LTCG:incremental

Is there a way to preven the complier from reordering the instructions and thus preserving the pipelined source code?

IamIC
  • 17,747
  • 20
  • 91
  • 154

1 Answers1

4

Software-pipelining at that small a scale is generally not necessary on CPUs with out-of-order execution, as long as you're using multiple accumulators so there is some ILP for the CPU to find.

Modern x86 CPUs are surprisingly robust as far as small-scale instruction scheduling, now that uop-caches mostly remove front-end decode / alignment issues. (But instruction position wrt. 32-byte boundaries still has an effect on the uop cache that can matter if you're front-end bottlenecked.)

Back-end bottlenecks because of instruction scheduling are rare until you get to much longer dep chains, larger than the RS size: See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for lots of details of how modern CPUs handle multiple long dep chains, and what the limits are for finding the ILP.

The only in-order CPU that could run this AVX code is first-gen Xeon Phi (Knight's Corner), and you'd normally want to use it's variant of AVX512 instead of AVX2.


Agreed that this instruction scheduling is probably worse than the order you used in the source.

At larger scales, or if you find that even at this scale that manually scheduling the instructions (e.g. by editing the compiler-generated asm) helps performance, then try using a better compiler.

gcc, clang, and ICC can all compile intrinsics, so you're not stuck with MSVC.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Peter, but a library such as Yeppp! achieves it record-breaking speeds by using precisely this sort of optimization. In my experience, CPUs are not great at looking ahead and rearranging ops despite the marketing hype. – IamIC Aug 27 '18 at 15:22
  • I think a better compiler is the answer. MSVC consistently reorders my code in such a way as to maximize latency. – IamIC Aug 27 '18 at 15:25
  • 2
    @IamIC: Did you read the linked question? Never mind marketing hype, that shows hard evidence from performance counters of out-of-order execution in action. Yes it's somewhat better to do some software-pipelining, but it's not a big deal *on a small scale*. I wouldn't recommend MSVC, but you might find that gcc does something similar. BTW, putting uops on the critical path earlier can maybe reduce resource conflicts, so if one of these dep chains was more important than the others it would make sense. [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331). – Peter Cordes Aug 27 '18 at 15:25
  • I'm reading it now. So the bottom line seems to be a) keep the code as my non-pipelined version has 0 paralell otions but that code does and, b) get a better compiler. – IamIC Aug 27 '18 at 15:32
  • Could you please clarify exactly what is considered to be small scale vs large scale in this contect. I have 5 ops, each of which will stall for 3 cycles. That's not efficient at all. – IamIC Aug 27 '18 at 15:34
  • 1
    @IamIC: Anything the front-end can issue in a couple cycles, like up to 8 uops, is definitely small scale. The out-of-order scheduler size is 54 in Sandybridge, 97 in Skylake. Use IACA ([What is IACA and how do I use it?](https://stackoverflow.com/q/26021337)) to do static analysis on your code for Haswell or whatever, and see that it does not in fact stall at all, if this is a loop body. Except for maybe 2 cycles on the first iteration before the first uop from the 2nd dependency chain gets issued. – Peter Cordes Aug 27 '18 at 15:47
  • 1
    @IamIC: see also [How does a single thread run on multiple cores?](https://softwareengineering.stackexchange.com/a/350024) (bogus title, but my answer there has an intro to out-of-order exec with some simple examples showing how *one* core finds the instruction-level parallelism in a single instruction stream.) – Peter Cordes Aug 27 '18 at 15:51