data alignment in structure and avx optimization

Question

I'm trying to figure out what is the best (maybe avx?) optimization for this code

typedef struct {
  float x;
  float y;
} vector;

vector add(vector u, vector v){
  return (vector){u.x+v.x, u.y+v.y};
}

running gcc -S code.c gives a quite long assembly code

    .file   "code.c"
    .text
    .globl  add
    .type   add, @function
add:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movss   16(%rbp), %xmm1
    movss   48(%rbp), %xmm0
    addss   %xmm0, %xmm1
    movss   32(%rbp), %xmm2
    movss   64(%rbp), %xmm0
    addss   %xmm2, %xmm0
    movq    -8(%rbp), %rax
    movss   %xmm1, (%rax)
    movq    -8(%rbp), %rax
    movss   %xmm0, 16(%rax)
    movq    -8(%rbp), %rax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   add, .-add
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

while I expected very few instructions for a so simple task. Could someone help me to optimize this kind of code, while keeping float types?

Thanks.

Too tired for a full answer, but look at sse intrinsics. Also do const vector &u etc to help the compiler. — starmole, Nov 16 '16 at 09:43
Actually that is "very few instructions" - you can remove some of the overhead though by making your function `inline`. Also `clang` with `-O3` seems to do a better job with this than `gcc`: https://godbolt.org/g/QRNOsL — Paul R, Nov 16 '16 at 09:47
@Fabio: macros offer no real advantage over inline functions, and have a few disadvantages too. — Paul R, Nov 16 '16 at 10:02
@PaulR: clang's output isn't "safe" :/ It can raise FP exceptions that it shouldn't if there's garbage in the high halves of the input registers (which the ABI does allow). It needs to zero the high halves before it can use `addps`. It could also slow down if the upper-half results are denormal, which matters even if you don't care about FP exceptions, unless you have DAZ and FTZ enabled. (But without `-ffast-math`, the compile can't assume either of those.) — Peter Cordes, Nov 16 '16 at 17:00
@starmole: First of all, this is C question. Second, passing this struct by const-reference would be strictly worse when it doesn't inline. The x86-64 SysV ABI passes this struct by value packed into an XMM register, so it's all set for a SIMD add. Passing in memory might help dumb compilers, but gcc still does a bad job passing it by pointer (compared to clang which does a good job in that case (https://godbolt.org/g/YbwF6t); loading with MOVQ solves its failure to zero-extend. It should probably use MOVSD though, in case some CPUs care about FP vs. int loads.) — Peter Cordes, Nov 16 '16 at 17:06
@PeterCordes the link you included in your comment is really useful. I can do online optimization of my code. Thanks! — Fabio, Nov 17 '16 at 19:29
You can do the same thing locally by looking at the asm output (see [this Q&A](http://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output)), but yeah Matt Godbolt's site is pretty damn cool, and has a nicer UI when the code is small enough that you don't need to search for the right symbol after ever recompile. — Peter Cordes, Nov 17 '16 at 19:51
@PeterCordes Yes I missed the C. And the keeping floats in the struct. So what is the right answer? The only thing I can think of then is turn the function into a #define. What do you suggest? — starmole, Nov 22 '16 at 06:56
@starmole: Already posted my proposed optimal ABI-compliant implementation [as a comment on Maxim's answer](http://stackoverflow.com/questions/40628195/data-alignment-in-structure-and-avx-optimization?noredirect=1#comment68507860_40630047). Obviously it would be much more efficient when inlined, so the compiler could take care of the upper halves of the registers and use just one ADDPS. Most of the cost of a stand-alone implementation is either unpacking/repacking or zeroing the upper halves! You don't need macros just to inline code these days; a `static inline` function is good. — Peter Cordes, Nov 22 '16 at 07:09
@PeterCordes It's the right assembly for that case, but how do you get the compiler to emit it? In this case gcc? How to do it portable? — starmole, Nov 22 '16 at 07:24
@starmole: you inline it, or use `-flto` so it can be inlined at link-time. Or if you need an efficient stand-alone definition (maybe to pass as a callback to something in another library?), and you can live with extra FP exceptions, you could just write the stand-alone definition entirely in asm. (No point using inline-asm; you don't want the compiler to use an inline-asm definition for any case where it *can* inline.) ... — Peter Cordes, Nov 22 '16 at 07:36
... Perhaps you could do something like declare the function as taking `struct` args that have 4 float members to it doesn't think there are any unused vector elements. But IDK if you have to just flat-out lie to the compiler (callers see a prototype that doesn't match the definition), or if the right amout of casting can do anything useful. IDK, it doesn't seem worth worrying about unless you have a specific use-case where it can't inline to one instruction. — Peter Cordes, Nov 22 '16 at 07:36

Maxim Egorushkin · Answer 1 · 2016-11-16T10:49:31.110

1

gcc can generate better code when using the Vector Instructions through Built-in Functions:

typedef float v2f __attribute__((vector_size(8)));

v2f add(v2f u, v2f v) {
  return u + v;
}

Produces:

add(float __vector(2), float __vector(2)):
    movlps  %xmm0, -32(%rsp)
    movlps  %xmm1, -40(%rsp)
    movss   -32(%rsp), %xmm0
    addss   -40(%rsp), %xmm0
    movss   %xmm0, -56(%rsp)
    movss   -28(%rsp), %xmm0
    addss   -36(%rsp), %xmm0
    movss   %xmm0, -52(%rsp)
    movlps  -56(%rsp), %xmm0
    ret

Which is still innefficient because it performs element-wise addition.

xmm registers are 128-bit wide, so that to fully utilize them the code needs to operate on 128-bit units.

In 3D graphics the coordinates are usually 4-element float vectors {x, y, z, w} and that makes xmm registers a perfect fit. E.g.:

typedef float v4f __attribute__((vector_size(16)));

v4f add(v4f u, v4f v) {
    return u + v;
}

That produces the following assembly for function add:

add(float __vector(4), float __vector(4)):
    addps   %xmm1, %xmm0
    ret

edited Nov 16 '16 at 10:49

answered Nov 16 '16 at 10:42

Maxim Egorushkin

131,725
17
180
271

For the application I am intended to, I need just 2D vectors and float type is enough. However your example was ok to understand (and maybe overcome) the issue. Thank you very much! F – Fabio Nov 16 '16 at 10:47
I wonder where I can find low level instruction reference. For instance it would be possible with just one instruction to compute the total of u*v, and doing the square root, something like a=sqrt(u[0]*v[0]+u[1]*v[1]+...)? – Fabio Nov 16 '16 at 11:05
What the crap? That asm is garbage. Hard to believe it's actually from gcc 5.4 `-O3`. Both adds are done separately with scalar. 4 separate stores to the stack, and a store-forwarding stall to load the result with MOVLPS after separate MOVSS stores. (And MOVLPS has a false dependency on the destination register; MOVSD would be faster.) – Peter Cordes Nov 16 '16 at 16:53
Optimal would be something like `movq %xmm0, %xmm0` / `movq %xmm1, %xmm1` or similar to zero the high halves, then `addps %xmm1, %xmm0` / `ret`. You could write something like that with intrinsics. – Peter Cordes Nov 16 '16 at 16:54
@PeterCordes Yep, it could just ignore the unused packed floats in the xmm registers. – Maxim Egorushkin Nov 16 '16 at 16:55
1

It can't ignore them, it does have to zero them because the ABI allows them to be garbage on input, leading to spurious FP exceptions or slowdowns from denormals. But after zeroing them, then yes it can do what clang unsafely does (see comments on the question). – Peter Cordes Nov 16 '16 at 17:09

data alignment in structure and avx optimization

1 Answers1