Assembly x86-64 - movupd with -O3

Question

I'm doing an exercise about x86-64 assembly, so I generated assembly code with GCC, using -O3 option.

The original code written in C:

double dotprod(double *a, double *b, unsigned long long n)
{
  double d1 = 0.0;
  double d2 = 0.0;

  for (unsigned long long i = 0; i < n; i += 2) {
    d1 += a[i] * b[i];
    d2 += a[i + 1] * b[i + 1];
  }

  return (d1 + d2);
}

Part of assembly code:

.L4:
    movupd  (%rdi,%rax), %xmm3
    movupd  (%rsi,%rax), %xmm2
    movupd  16(%rdi,%rax), %xmm1
    movlpd  8(%rdi,%rax), %xmm1
    movhpd  16(%rdi,%rax), %xmm3
    movhpd  16(%rsi,%rax), %xmm2
    mulpd   %xmm3, %xmm2
    addsd   %xmm2, %xmm4
    unpckhpd    %xmm2, %xmm2
    addsd   %xmm2, %xmm4
    movupd  16(%rsi,%rax), %xmm2
    movlpd  8(%rsi,%rax), %xmm2
    addq    $32, %rax
    mulpd   %xmm2, %xmm1
    movapd  %xmm1, %xmm2
    unpckhpd    %xmm1, %xmm1
    addsd   %xmm0, %xmm2
    movapd  %xmm1, %xmm0
    addsd   %xmm2, %xmm0
    cmpq    %rdx, %rax
    jne .L4
    movq    %rcx, %rdx
    andq    $-2, %rdx
    leaq    (%rdx,%rdx), %rax
    cmpq    %rcx, %rdx
    je  .L5

I wonder what is the int we can read in some instructions like movupd 16(%rdi,%rax), %xmm1

_"what is the int we can read in some instructions like movupd 16(%rdi,%rax), %xmm1"_ A displacement. See section _3.7.5 Specifying an Offset_ in Intel's manual. — Michael, Dec 15 '20 at 13:16
It's not loading an `int`, it's loading 2 doubles. That's why GCC used a `pd` instruction, instead of `movdqu` or the more compact `movups` which would have been a better choice here. But it's weird that the next instruction merges into low half of that load with `movlpd`. IDK what the point of doing a 16-byte load in the first place was, vs. `movsd` / `movhpd` if GCC wants to get 2 non-contiguous doubles into XMM1. — Peter Cordes, Dec 15 '20 at 15:19
clang sorts out your manual unrolling back into sane vectorization. GCC makes a total mess. https://godbolt.org/z/s67dG6. This is a nasty missed optimization / anti-optimization. GCC7 and earlier just use scalar, which is probably less bad than what GCC8 and later do (what you've shown). GCC `-ffast-math` gets it right, but simple vectorization already does the operation in source order. — Peter Cordes, Dec 15 '20 at 15:25
I wonder if GCC is basically trying to vectorize `d1` and `d2` separately, instead of mapping those C variables to two halves of a single vector? — Peter Cordes, Dec 15 '20 at 15:32
Reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291 — Peter Cordes, Dec 15 '20 at 15:51

Assembly x86-64 - movupd with -O3

0 Answers0