Why does gcc's code-gen for my unrolled loop epilogue look over-complicated?

Question

Thanks for all the comments so far. I am sorry that I have used a bad example in my original question, that almost everyone would say: "Oh, you should use memcopy!" But that is not what my question is about.

My question is more generic about how manual loop unrolling should be done. Consider this example this time, by summing all elements in an array:

#include <stdlib.h>

double sum (size_t n, double *x) {
  size_t nr = n & 1;
  double *end = x + (n - nr);
  double sum_x = 0.0;
  for (; x < end; x++) sum_x += *x;
  if (nr) sum_x += *x;
  return sum_x;
  }

The compiler generated assembly admits a similar behaviour (to what is shown by the array-copying example in my original question)

sum:
  movq %rdi, %rcx
  andl $1, %ecx
  subq %rcx, %rdi
  leaq (%rsi,%rdi,8), %rdx
  cmpq %rdx, %rsi
  jnb .L5
  movq %rsi, %rax
  pxor %xmm0, %xmm0
.L3:
  addsd (%rax), %xmm0
  addq $8, %rax
  cmpq %rax, %rdx
  ja .L3
  movq %rsi, %rax
  notq %rax
  addq %rax, %rdx
  shrq $3, %rdx
  leaq 8(%rsi,%rdx,8), %rsi
.L2:
  testq %rcx, %rcx
  je .L1
  addsd (%rsi), %xmm0
.L1:
  ret
.L5:
  pxor %xmm0, %xmm0
  jmp .L2

However, if I now schedule the "fractional" part ahead of the main loop (as I later dig out in an answer I posted), the compiler does much better job.

#include <stdlib.h>

double sum (size_t n, double *x) {
  size_t nr = n & 1;
  double *end = x + n;
  double sum_x = 0.0;
  if (nr) sum_x += *x;
  for (x += nr; x < end; x++) sum_x += *x;
  return sum_x;
  }

sum:
  leaq (%rsi,%rdi,8), %rdx
  pxor %xmm0, %xmm0
  andl $1, %edi
  je .L2
  addsd (%rsi), %xmm0
.L2:
  leaq (%rsi,%rdi,8), %rax
  cmpq %rax, %rdx
  jbe .L1
.L4:
  addsd (%rax), %xmm0
  addq $8, %rax
  cmpq %rax, %rdx
  ja .L4
.L1:
  ret

I have only used a compiler flag -O2. So as Peter said, the compiler generated assembly should be close to C source code. Then the question is, why does a compiler do better in the latter case?

This is not really a performance-related question. It is just something I unconsciously found (and can't explain) when checking compiler's assembly output for C code from a C project I have been writing. Thanks again. Thank Peter for proposing a better title for the question.

Original question:

The following small C function copies a, a vector of n entries to b. A manual loop unrolling of depth 2 is applied.

#include <stddef.h>

void foo (ptrdiff_t n, double *a, double *b) {
  ptrdiff_t i = 0;
  ptrdiff_t nr = n & 1;
  n -= nr;                  // `n` is an even integer
  while (i < n) {
    b[i] = a[i];
    b[i + 1] = a[i + 1];
    i += 2;
    }                       // `i = n` when the loop ends
  if (nr) b[i] = a[i];
  }

It gives the x64 assembly under gcc -O2 (any gcc version 5.4+). However, I find the part of the output as commented weird. Why does the compiler ever generate them?

foo:
  movq %rdi, %rcx
  xorl %eax, %eax
  andl $1, %ecx
  subq %rcx, %rdi
  testq %rdi, %rdi
  jle .L11
.L12:
  movsd (%rsi,%rax,8), %xmm0
  movsd %xmm0, (%rdx,%rax,8)
  movsd 8(%rsi,%rax,8), %xmm0
  movsd %xmm0, 8(%rdx,%rax,8)
  addq $2, %rax
  cmpq %rax, %rdi           // `i` in %rax, `n` in %rdi
  jg .L12                   // the loop ends, with `i = n`, BELOW IS WEIRD
  subq $1, %rdi             // n = n - 1;
  shrq %rdi                 // n = n / 2;
  leaq 2(%rdi,%rdi), %rax   // i = 2 * n + 2;  (this is just `i = n`, isn't it?)
.L11:
  testq %rcx, %rcx
  je .L10
  movsd (%rsi,%rax,8), %xmm0
  movsd %xmm0, (%rdx,%rax,8)
.L10:
  ret

A similar version using size_t instead of ptrdiff_t gives something similar:

#include <stdlib.h>

void bar (size_t n, double *a, double *b) {
  size_t i = 0;
  size_t nr = n & 1;
  n -= nr;                  // `n` is an even integer
  while (i < n) {
    b[i] = a[i];
    b[i + 1] = a[i + 1];
    i += 2;
    }                       // `i = n` when the loop ends
  if (nr) b[i] = a[i];
  }

bar:
  movq %rdi, %rcx
  andl $1, %ecx
  subq %rcx, %rdi
  je .L20
  xorl %eax, %eax
.L21:
  movsd (%rsi,%rax,8), %xmm0
  movsd %xmm0, (%rdx,%rax,8)
  movsd 8(%rsi,%rax,8), %xmm0
  movsd %xmm0, 8(%rdx,%rax,8)
  addq $2, %rax
  cmpq %rax, %rdi           // `i` in %rax, `n` in %rdi
  ja .L21                   // the loop ends, with `i = n`, BUT BELOW IS WEIRD
  subq $1, %rdi             // n = n - 1;
  andq $-2, %rdi            // n = n & (-2);
  addq $2, %rdi             // n = n + 2;  (this is just `i = n`, isn't it?)
.L20:
  testq %rcx, %rcx
  je .L19
  movsd (%rsi,%rdi,8), %xmm0
  movsd %xmm0, (%rdx,%rdi,8)
.L19:
  ret

And here is another equivalence,

#include <stdlib.h>

void baz (size_t n, double *a, double *b) {
  size_t nr = n & 1;
  n -= nr;
  double *b_end = b + n;
  while (b < b_end) {
    b[0] = a[0];
    b[1] = a[1];
    a += 2;
    b += 2;
    }                       // `b = b_end` when the loop ends
  if (nr) b[0] = a[0];
  }

but the following assembly looks more odd (though produced under -O2). Now n, a and b are all copied, and when the loop ends, we take 5 lines of code just to end up with b_copy = 0?!

baz:                        // initially, `n` in %rdi, `a` in %rsi, `b` in %rdx
  movq %rdi, %r8            // n_copy = n;
  andl $1, %r8d             // nr = n_copy & 1;
  subq %r8, %rdi            // n_copy -= nr;
  leaq (%rdx,%rdi,8), %rdi  // b_end = b + n;
  cmpq %rdi, %rdx           // if (b >= b_end) jump to .L31
  jnb .L31
  movq %rdx, %rax           // b_copy = b;
  movq %rsi, %rcx           // a_copy = a;
.L32:
  movsd (%rcx), %xmm0
  addq $16, %rax
  addq $16, %rcx
  movsd %xmm0, -16(%rax)
  movsd -8(%rcx), %xmm0
  movsd %xmm0, -8(%rax)
  cmpq %rax, %rdi           // `b_copy` in %rax, `b_end` in %rdi
  ja .L32                   // the loop ends, with `b_copy = b_end`
  movq %rdx, %rax           // b_copy = b;
  notq %rax                 // b_copy = ~b_copy;
  addq %rax, %rdi           // b_end = b_end + b_copy;
  andq $-16, %rdi           // b_end = b_end & (-16);
  leaq 16(%rdi), %rax       // b_copy = b_end + 16;
  addq %rax, %rsi           // a += b_copy;   (isn't `b_copy` just 0?)
  addq %rax, %rdx           // b += b_copy;
.L31:
  testq %r8, %r8            // if (nr == 0) jump to .L30
  je .L30
  movsd (%rsi), %xmm0       // xmm0 = a[0];
  movsd %xmm0, (%rdx)       // b[0] = xmm0;
.L30:
  ret

Can anyone explain what the compiler has in mind in all three cases?

Do you expect to be able to understand assembly code produced by an optimising compiler? Long ago I gave that up as futile. — Yunnosch, Jul 04 '18 at 08:21
you tend to see weird things happening with compiler optimization, because compiler optimization does weird things to optimize code. These things could include loop rollouts, ordering things for better caching etc. — Nova Ardent, Jul 04 '18 at 08:31
@NovaArdent: but none of those things are happening here; it's not auto-vectorizing or unrolling (because `-O2` doesn't include `-ftree-vectorize` or `-funroll-loops`, and the memcpy pattern detector can't work because they didn't use `restrict`), so what the asm is doing is pretty close to the C source. The question isn't about scheduling instructions either, just about why `n -= (n&1)` compiles the way it does. — Peter Cordes, Jul 04 '18 at 09:10
@PeterCordes I never said those specific things were happening in this instance. IIRC though -gX in gcc will enable some of these flags will it not? — Nova Ardent, Jul 04 '18 at 09:12
@NovaArdent: no, `-g` options don't affect code-gen, only debug metadata. — Peter Cordes, Jul 04 '18 at 09:13
@PeterCordes Alright, well learning is what this site is about, so I'll leave my comments here so yours make sense, and so the small conversation can be a learning moment — Nova Ardent, Jul 04 '18 at 09:15
@NovaArdent: yup that's fine. I just get annoyed when people ask interesting detailed compiler / performance questions and the first responses are "it's too complicated for mortals", and your comment was along those lines, too. (The question could use a better summary of what the instructions in question are doing, and a link to Godbolt, though. i.e. a good title might be "why does gcc's code-gen for my unrolled loop epilogue look over-complicated?") — Peter Cordes, Jul 04 '18 at 09:20
Now with the updated question, the answer is in the first case, the value of nr has to be remembered until after the loop, and the extra calculation takes place. In the second case, nr is dispensed with permanently as soon as possible (and no tail calculation), allowing analysis of a less complex situation, thus likely better assembly. — Gunther Schulz, Jul 04 '18 at 10:42

score 1 · Accepted Answer · answered Jul 04 '18 at 09:11

1

Looks like if I unroll the loop in the following manner, a compiler can generate neater code.

#include <stdlib.h>
#include <stddef.h>

void foo (ptrdiff_t n, double *a, double *b) {
  ptrdiff_t i = n & 1;
  if (i) b[0] = a[0];
  while (i < n) {
    b[i] = a[i];
    b[i + 1] = a[i + 1];
    i += 2;
    }
  }

void bar (size_t n, double *a, double *b) {
  size_t i = n & 1;
  if (i) b[0] = a[0];
  while (i < n) {
    b[i] = a[i];
    b[i + 1] = a[i + 1];
    i += 2;
    }
  }

void baz (size_t n, double *a, double *b) {
  size_t nr = n & 1;
  double *b_end = b + n;
  if (nr) b[0] = a[0];
  b += nr;
  while (b < b_end) {
    b[0] = a[0];
    b[1] = a[1];
    a += 2;
    b += 2;
    }
  }

foo:
  movq %rdi, %rax
  andl $1, %eax
  je .L9
  movsd (%rsi), %xmm0
  movsd %xmm0, (%rdx)
  cmpq %rax, %rdi
  jle .L11
.L4:
  movsd (%rsi,%rax,8), %xmm0
  movsd %xmm0, (%rdx,%rax,8)
  movsd 8(%rsi,%rax,8), %xmm0
  movsd %xmm0, 8(%rdx,%rax,8)
  addq $2, %rax
.L9:
  cmpq %rax, %rdi
  jg .L4
.L11:
  ret

bar:
  movq %rdi, %rax
  andl $1, %eax
  je .L20
  movsd (%rsi), %xmm0
  movsd %xmm0, (%rdx)
  cmpq %rax, %rdi
  jbe .L21
.L15:
  movsd (%rsi,%rax,8), %xmm0
  movsd %xmm0, (%rdx,%rax,8)
  movsd 8(%rsi,%rax,8), %xmm0
  movsd %xmm0, 8(%rdx,%rax,8)
  addq $2, %rax
.L20:
  cmpq %rax, %rdi
  ja .L15
.L21:
  ret

baz:
  leaq (%rdx,%rdi,8), %rcx
  andl $1, %edi
  je .L23
  movsd (%rsi), %xmm0
  movsd %xmm0, (%rdx)
.L23:
  leaq (%rdx,%rdi,8), %rax
  cmpq %rax, %rcx
  jbe .L22
.L25:
  movsd (%rsi), %xmm0
  addq $16, %rax
  addq $16, %rsi
  movsd %xmm0, -16(%rax)
  movsd -8(%rsi), %xmm0
  movsd %xmm0, -8(%rax)
  cmpq %rax, %rcx
  ja .L25
.L22:
  ret

answered Jul 04 '18 at 09:11

Zheyuan Li

71,365
17
180
248

2

This is mostly academic / curiousity, right? If you actually want it to run fast, you'd use `restrict` so it can recognize it as `memcpy` and inline an optimized copy, or call the library function. And/or compile with full optimization (`-O3`) instead of only partial optimization (`-O2`). `O3 -march=native` is good if you want something tuned for the computer it's compiled on, using whatever SSE/AVX versions it supports. – Peter Cordes Jul 04 '18 at 09:16
The goal for optimized code is not to be "neat", it is to be *optimized*. – Jongware Jul 04 '18 at 09:16
@PeterCordes Yes, just want a reasonable explanation. In fact, this kind of loop is memory-bound, so loop unrolling and SIMD would not be beneficial at all. – Zheyuan Li Jul 04 '18 at 09:19
@usr2564301: this code does look better optimized than the code-gen in the question, if the branch predicts well. If not, then maybe it would be best to have the final-iteration check after the loop, where a mispredict can resolve while a bunch of copies are still in flight. – Peter Cordes Jul 04 '18 at 09:22
@李哲源: memory is fast these days, especially if any of your data is hot in L3. If it's not, and you're purely tuning this for gigantic copies, you should be using NT stores to bypass the cache instead of trigger RFO for the output. Copying in 8-byte chunks is also needlessly unfriendly for hyperthreading, filling up the pipeline with more uops to slow down the other logical core sharing the same physical core. Smaller chunks means out-of-order execution doesn't run as far ahead, so you aren't triggering page-walks as far ahead of where you're reading when close to a page boundary. – Peter Cordes Jul 04 '18 at 09:28
See also [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) for more about x86 memory bandwidth. – Peter Cordes Jul 04 '18 at 09:29
I just tried your `foo` vs. glibc `memcpy` for large copies, on my Skylake i7-6700k (8MiB cache) with dual channel DDR4-2400. For 100000000 bytes (100MB) copied 1k times, your version takes 9.77+-0.5% sec, while memcpy takes 6.68+-0.4% seconds. (The CPU only clocked at ~3.3GHz for memcpy, but 3.7GHz for yours, because I didn't force it to max clock speed like it would if there was real work on other cores.) actual test code: https://godbolt.org/g/DYtKTR. Strangely, for copy sizes around 5M (fits in L3 cache), the difference is much smaller. But for 500kB * 10k iter, it's 0.24 vs. 0.15 – Peter Cordes Jul 04 '18 at 09:57
Cool, yeah, I thought it was interesting even though the loop being unrolled in this case was just memcpy. Hopefully your edit makes that clear to more readers. – Peter Cordes Jul 04 '18 at 10:30

score 0 · Answer 2 · answered Jul 04 '18 at 08:36

If you're asking why the assembly is relatively large, its because the compiler can't assume what you may know.

For example, if you know the source array will not be modified during the copy, tell the compiler so by adding a const qualifier to the pointed at source data.

void foo (ptrdiff_t n, double *a, double const *b)

Further, if you know the two memory ranges will never overlap, add a restrict qualifier to each of the two pointers.

void foo (ptrdiff_t n, double *restrict a, double const *restrict b)

Ultimately, if you want the most optimized copy (compiler vendors spend a LOT of time on this), use memcpy for non-overlapping ranges, and memmove for overlapping ranges.

It doesn't auto-vectorize because the OP didn't enable auto-vectorization (`-O2`). The question was about the address calc for the cleanup epilogue outside the loop, not the main loop anyway. Your first sentence is the answer to the question, though, for the unsigned part. — Peter Cordes, Jul 04 '18 at 09:04

Why does gcc's code-gen for my unrolled loop epilogue look over-complicated?

2 Answers2