1

Using godbolt.org x86-64 gcc 11.2, This code...

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

...produces this assembly (when compiled with -O3 -mavx)...

mul:
        vpmulld xmm0, xmm0, xmm1
        ret

However this code...

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
    struct {int x,y,z,w;}; // this line is the change
    int i[4]; // this one too
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

...produces this assembly (when also compiled with -O3 -mavx)...

mul:
        mov     QWORD PTR [rsp-40], rdi
        mov     QWORD PTR [rsp-32], rsi
        vmovdqa xmm1, XMMWORD PTR [rsp-40]
        mov     QWORD PTR [rsp-24], rdx
        mov     QWORD PTR [rsp-16], rcx
        vpmulld xmm0, xmm1, XMMWORD PTR [rsp-24]
        vmovdqa XMMWORD PTR [rsp-40], xmm0
        mov     rax, QWORD PTR [rsp-40]
        mov     rdx, QWORD PTR [rsp-32]
        ret

x86-64 clang 13.0.1 has similar results

So my question is, how can I convince gcc (and/or clang) that these 2 blocks of code should produce the same output?

I've tried __attribute__ ((aligned)), removing the int i[4]; or the struct, applying __attribute__ ((packed)) to the struct, I even gave __attribute__ ((transparent_union)) a go. Whatever magic status __attribute__ ((vector_size (16))) bestows is broken by adding anything to the union.

burito
  • 813
  • 1
  • 7
  • 22
  • Almost a duplicate of [SSE vector wrapper type performance compared to bare \_\_m128](https://stackoverflow.com/q/36833462) but that's talking about 32-bit mode Windows calling conventions, not x86-64 System V. Related: [What is the calling convention for floating-point values in C for x86\_64 in System V?](https://stackoverflow.com/q/57859857) and also [C++ operator\[\] access to elements of SIMD (e.g. AVX) variable](https://stackoverflow.com/q/64282775) – Peter Cordes Mar 17 '22 at 15:49

2 Answers2

0

I should say that I have never worked with this attribute personally and I checked the gcc just now, but from document I saw something that I think will be useful for your problem.

From your code, I can assume that you want to use union to access each int of vector separately. But if it is the only reason, it is not necessary to use int[4] or struct {int x,y,z,w;}; as part of union, because vectors can be used like arrays themselves:

typedef int v4i __attribute__ ((vector_size (16)));

typedef union {
    v4i v;
} int4;

int4 mul(int4 l, int4 r)
{
    int4 ret = (int4){.v=l.v * r.v};
    printf("%i %i %i %i", ret.v[0], ret.v[1], ret.v[2], ret.v[3]);
    return ret;
}

and the code will be optimized as you like. In addition, if you need byte level access, union with another vector works as you like too:

typedef int v4i __attribute__ ((vector_size (16)));
typedef unsigned char v4b __attribute__ ((vector_size (16)));

struct i4s{int x,y,z,w;};

typedef union {
    v4i v;
    v4b v2;
} int4;

int4 mul(int4 l, int4 r)
{
    return (int4){.v=l.v * r.v};
}

It seems that union will work with primitive-like types in this case. for example even __m128i works too.

Afshin
  • 8,839
  • 1
  • 18
  • 53
  • The code presented is a minimalist example of a much larger chain of similar unions, with as far as is practical, GLSL style swizzling into smaller types. For example, there is a `struct { int2 xy, zw;};` and `struct { int3 xyz; int __w2; };`. Also `__m128i` isn't an option as there is no `_mm128i_mul_epi32()` function. That and wintel isn't the only platform with SIMD capability. It is however a useful yard-stick of compiler capability, features working here will sooner or later work on ARM and friends (if they don't already). – burito Mar 17 '22 at 06:34
  • @burito: Wintel? If you're worried about Windows, look at its calling convention (`__attribute__((ms_abi))` or vectorcall), since it has different rules for union passing than GNU C. (Without vectocall it will never pass a 16-byte vector in an XMM reg even for a bare `vri`). Unless you plan to use `__attribute__((sysv_abi))` in your code even when targeting Windows. But really, most of these tiny functions should inline, so it's not necessarily a showstopper. – Peter Cordes Mar 17 '22 at 15:47
0

Turns out, they are the same. For some reason the second one includes the populating the xmm? registers from the stack, but if for example one adds a main function...

int main(int argc, char *argv[])
{
    // volatile keyword added so they don't get optimised out   
    volatile int4 x = {.v={1,2,3,4}};
    volatile int4 y = {.v={1,2,3,4}};
    int4 z = mul(x, y);
    
    return z.v[0];
}

...then the function (or single vpmulld instruction in this case) gets inlined, and different, appropriate stack manipulation gets inserted.

burito
  • 813
  • 1
  • 7
  • 22
  • 1
    It's not "from the stack", it's from RSI:RDI function args, just happening to use a silly strategy involving the stack. The calling convention passes a 16-byte union in integer regs unless it contains *only* a SIMD vector (type SSE or SSEUPPER in the x86-64 SysV ABI doc). – Peter Cordes Mar 17 '22 at 15:34
  • 1
    I think I should rewrite my question as "hey I don't understand what these instructions are here for". Thank you for answering the question I didn't know I was asking, and providing some very helpful links to set me on the path! – burito Mar 17 '22 at 17:15