I'm attempting to write C code that masks CPU op latency by using pipelining. Here is an excerpt:
__m256 v256f_rslt_0 = _mm256_loadu_ps(&ch_results_8[pos + (0 * FLOATS_IN_M256)]);
__m256 v256f_rslt_1 = _mm256_loadu_ps(&ch_results_8[pos + (1 * FLOATS_IN_M256)]);
__m256 v256f_rslt_2 = _mm256_loadu_ps(&ch_results_8[pos + (2 * FLOATS_IN_M256)]);
__m256 v256f_rslt_3 = _mm256_loadu_ps(&ch_results_8[pos + (3 * FLOATS_IN_M256)]);
__m256 v256f_scale_0 = _mm256_loadu_ps(&cl_8[pos + (0 * FLOATS_IN_M256)]);
__m256 v256f_scale_1 = _mm256_loadu_ps(&cl_8[pos + (1 * FLOATS_IN_M256)]);
__m256 v256f_scale_2 = _mm256_loadu_ps(&cl_8[pos + (2 * FLOATS_IN_M256)]);
__m256 v256f_scale_3 = _mm256_loadu_ps(&cl_8[pos + (3 * FLOATS_IN_M256)]);
v256f_rslt_0 = _mm256_max_ps(v256f_rslt_0, v256f_c_zero);
v256f_rslt_1 = _mm256_max_ps(v256f_rslt_1, v256f_c_zero);
v256f_rslt_2 = _mm256_max_ps(v256f_rslt_2, v256f_c_zero);
v256f_rslt_3 = _mm256_max_ps(v256f_rslt_3, v256f_c_zero);
v256f_rslt_0 = _mm256_mul_ps(v256f_rslt_0, v256f_scale_0);
v256f_rslt_1 = _mm256_mul_ps(v256f_rslt_1, v256f_scale_1);
v256f_rslt_2 = _mm256_mul_ps(v256f_rslt_2, v256f_scale_2);
v256f_rslt_3 = _mm256_mul_ps(v256f_rslt_3, v256f_scale_3);
There are 5 math ops * 4; 2 are shown.
However, the compiler destroys the pipelining. Here's a portion of the ASM:
vmaxps ymm2, ymm0, ymm10
vmulps ymm0, ymm2, YMMWORD PTR [r9+rax-96]
vminps ymm2, ymm0, ymm7
vmovups ymm0, YMMWORD PTR [rax-64]
vmulps ymm6, ymm3, ymm8
vsubps ymm3, ymm7, ymm2
vmaxps ymm2, ymm0, ymm10
vmulps ymm0, ymm2, YMMWORD PTR [r9+rax-64]
vminps ymm2, ymm0, ymm7
vmovups ymm0, YMMWORD PTR [rax-160]
vmulps ymm5, ymm3, ymm8
vsubps ymm3, ymm7, ymm2
The compiler has clearly grouped the code into 4 blocks, which means maximum latency will occur.
Compiler optimizations: /O2 /Oi /Ot /GL
Linker optimizations: /OPT:REF /OPT:ICF /LTCG:incremental
Is there a way to preven the complier from reordering the instructions and thus preserving the pipelined source code?