3

Why are serializing instructions inherently pipeline-unfriendly?

On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:

Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.

I think it should be the opposite. Serialized instructions are very good for pipe line. For example,

sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;

Assembly by g++ main.cpp -S

addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax

Is much better for pipe line, instead of:

for( int i = 0; i < 7; i++ )
{
    sum = 5 * sum;
}

sum = sum + 5;

Assembly by g++ main.cpp -S

    movl    $0, -4(%rbp)
    movl    $0, -8(%rbp)
.L3:
    cmpl    $6, -8(%rbp)
    jg  .L2
    movl    -4(%rbp), %edx
    movl    %edx, %eax
    sall    $2, %eax
    addl    %edx, %eax
    movl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L3
.L2:
    addl    $5, -4(%rbp)
    movl    $0, %eax
    addq    $48, %rsp
    popq    %rbp

Because each time the loop goes:

  1. Is need to perform a if( i < 7 )
  2. Adding branch prediction, for the above loop we could assume the first time the prediction will fail
  3. The instruction sum = sum + 5 will be discarded.
  4. And the next time the pipe line will do sum = 5 * sum,
  5. Until the condition if( i < 7 ) fail,
  6. Then the sum = 5 * sum will be discarded
  7. And sum = sum + 5 will be finally processed.
Community
  • 1
  • 1
Evandro Coan
  • 8,560
  • 11
  • 83
  • 144
  • 1
    You've misunderstood what "serializing instructions" are. They're a specific classification of CPU instructions that "serialize" the CPU. They cause the CPU to wait until the pipeline is empty before executing them. CPUID is an example of a serializing instruction. None of the instructions you've used in your question are serializing instructions. – Ross Ridge Mar 04 '17 at 23:31

2 Answers2

6

You confused “serialized” with “serializing.” A serializing instruction is one that guarantees a data ordering, i.e. everything before this instruction happens before everything after this instruction.

This is bad news for super-scalar and pipelined processors which usually don't make this guarantee and have to make special accomendations for it, e.g. by flushing the pipeline or by waiting for all execution units to be finished.

Incidentally, this is some times exactly what you want in a benchmark as it forces the pipeline into a predictable state with all execution units being ready to execute your code; no stale writes from before the benchmark can cause any performance deviations.

fuz
  • 88,405
  • 25
  • 200
  • 352
0

I think he meant serialization as dependency.

sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;

would be slower than parallel version:

sum1 = 5 * sum1;
sum2 = 5 * sum2;
sum1 = 5 * sum1;
sum2 = 5 * sum2;
sum1 = 5 * sum1;
sum2 = 5 * sum2;
sum1 = 5 * sum1;
sum = sum2*sum1;

because there are multiple pipelines and each pipeline can work on multiple instructions in flight so there could be sum1 sum2 ... sum8 many accumulators issued at the same time.

If serializer instructions are long enough, it makes pipeline ready for measurement after N cycles since new instructions cannot start without completing the last one(for serializer instructions).

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97