Why are serializing instructions inherently pipeline-unfriendly?
On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
I think it should be the opposite. Serialized instructions are very good for pipe line. For example,
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
Assembly by g++ main.cpp -S
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
Is much better for pipe line, instead of:
for( int i = 0; i < 7; i++ )
{
sum = 5 * sum;
}
sum = sum + 5;
Assembly by g++ main.cpp -S
movl $0, -4(%rbp)
movl $0, -8(%rbp)
.L3:
cmpl $6, -8(%rbp)
jg .L2
movl -4(%rbp), %edx
movl %edx, %eax
sall $2, %eax
addl %edx, %eax
movl %eax, -4(%rbp)
addl $1, -8(%rbp)
jmp .L3
.L2:
addl $5, -4(%rbp)
movl $0, %eax
addq $48, %rsp
popq %rbp
Because each time the loop goes:
- Is need to perform a
if( i < 7 )
- Adding branch prediction, for the above loop we could assume the first time the prediction will fail
- The instruction
sum = sum + 5
will be discarded. - And the next time the pipe line will do
sum = 5 * sum
, - Until the condition
if( i < 7 )
fail, - Then the
sum = 5 * sum
will be discarded - And
sum = sum + 5
will be finally processed.