CLANG optimizing using SVML and it's autovectorization

Question

Consider simple function:

#include <math.h>
void ahoj(float *a)
{
    for (int i=0; i<256; i++) a[i] = sin(a[i]);
}

Try that at https://godbolt.org/z/ynQKRb, and use following settings

-fveclib=SVML -mfpmath=sse -ffast-math -fno-math-errno -O3 -mavx2 -fvectorize

Select x86_64 CLANG 7.0, currently the newest. This is the most interesting part of the result:

vmovups ymm0, ymmword ptr [rdi]
vmovups ymm1, ymmword ptr [rdi + 32]
vmovups ymmword ptr [rsp], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rdi + 64]
vmovups ymmword ptr [rsp + 32], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rdi + 96]
vmovups ymmword ptr [rsp + 96], ymm1 # 32-byte Spill
call    __svml_sinf8
vmovups ymmword ptr [rsp + 64], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 32] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp + 32], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 96] # 32-byte Reload
call    __svml_sinf8
vmovups ymm1, ymmword ptr [rsp + 64] # 32-byte Reload
vmovups ymmword ptr [rbx], ymm1
vmovups ymm1, ymmword ptr [rsp] # 32-byte Reload
vmovups ymmword ptr [rbx + 32], ymm1
vmovups ymm1, ymmword ptr [rsp + 32] # 32-byte Reload
vmovups ymmword ptr [rbx + 64], ymm1
vmovups ymmword ptr [rbx + 96], ymm0
vmovups ymm0, ymmword ptr [rbx + 128]
vmovups ymm1, ymmword ptr [rbx + 160]
vmovups ymmword ptr [rsp], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rbx + 192]
vmovups ymmword ptr [rsp + 32], ymm1 # 32-byte Spill
vmovups ymm1, ymmword ptr [rbx + 224]
vmovups ymmword ptr [rsp + 96], ymm1 # 32-byte Spill
call    __svml_sinf8
vmovups ymmword ptr [rsp + 64], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 32] # 32-byte Reload
call    __svml_sinf8
vmovups ymmword ptr [rsp + 32], ymm0 # 32-byte Spill
vmovups ymm0, ymmword ptr [rsp + 96] # 32-byte Reload
call    __svml_sinf8
...

It literally avoids any loops and instead creates code for processing 256 items. Could this really be optimal solution considering code cache? When using -mavx512f, it expands even 1024 items :).

Another problem is that with this option current CLANG sometimes generates AVX512 code even if the target is AVX2 making it basically unusable.

clang might be a bit too aggressive in unrolling sometimes. Have you tried using `-march=skylake` to set *tuning* options for a specific CPU instead of generic? `-mavx2` doesn't affect tuning options, like how much unrolling is worth it. — Peter Cordes, Sep 29 '18 at 04:10
By far the more important issue is generating AVX512 code that faults when you only specified `-mavx2`. But this isn't a [mcve] for that. Sounds like a separate question, really. Or the more important part of this question, since from your last question you seem to be most interested in getting clang to auto-vectorize `sinf()`, and that's key to doing so in a usable way. Where are the AVX512 instructions? In clang-generated code, or in Intel SVML library functions it calls like you commented on your last question? — Peter Cordes, Sep 29 '18 at 04:12
Will -march=skylake make it work on all AVX2 capable CPUs? As for the AVX512 problem, I was looking at it, but it seems somehow problematic to find out, it crashed in some complex routine under several layers and without the debuginfo I wasn't really able to tell what caused and creating a manimal example is virtually impossible. All I know is that this works with MSVC and CLANG with any settings, only if I specify this SVML thingy it works on AVX512 machine, but not on AVX2 machine. — Vojtěch Melda Meluzín, Sep 29 '18 at 10:26
No, AMD Excavator APUs have AVX2 but not BMI2. But all AVX2 CPUs have FMA3, I think, so if you have any floating point code that can be a big speedup. (Some AVX1-only CPUs (like Piledriver) have FMA3 as well.) You might want to use `-mavx2 -mfma -mbmi1 -mpopcnt -mtune=haswell`, or whatever. Or maybe `-march=sandybridge -mavx2 -mfma -mtune=haswell -mbmi1` in case there's anything I'm forgetting other than popcnt. I'm just saying it's worth investigating tuning options. And beware that a simulated x86 might have different combinations of stuff than any real hardware. — Peter Cordes, Sep 29 '18 at 10:33
I'm a bit confused in this one, since I'm no expert in all these haswells and sandybridges and stuff :). In any case I need to make sure the code works on ALL AVX2 capable machines, so I'd be a bit worried using all these options. Anyways I tried them just for fun in the godbolt and "compilation failed" I'm afraid. And just using mtune didn't really change the way the compiler avoids the loop. It's not a big deal, I was more just checking if there isn't something I just don't get about CPU architecture that would make this weird stuff actually beneficial :). — Vojtěch Melda Meluzín, Sep 29 '18 at 12:02
Oops, the option for BMI1 is `-mbmi` not `-mbmi1`. Anyway, for your Mac binaries, you can assume Intel CPUs. There's even a fat-binary option that lets you make baseline and `-march=haswell` versions of your code. On Windows, you do have to consider the possibility of AMD CPUs, but you might still want to *tune* for Intel. `-mavx2 -mfma -mbmi -mpopcnt -mtune=haswell` should take advantage of the useful features that all AVX2 CPUs have; clang won't use PCLMUL or AES on its own. You can and should check the CPUID feature bits for all those features as well as AVX2 before running the function. — Peter Cordes, Sep 29 '18 at 12:22
@PeterCordes The VIA Isaiah C4650 has AVX2 but not FMA3. Breaks a lot of programs that assume FMA3 in the presence of AVX2. — Mysticial, Oct 01 '18 at 15:35
Btw, I spoke to one of the VIA architects at Hot Chips about it. And he was pissed that they they allowed that to happen. IIRC, he hinted that they should've either turned off the CPUID for AVX2 or microcoded the FMA. — Mysticial, Oct 01 '18 at 15:42
Really??? Holy crap... fortunately these chips are not really well spread, I need to support only OSX and Windows and only "normal" computers... But still. — Vojtěch Melda Meluzín, Oct 01 '18 at 20:08
@Mysticial: Neat, I didn't realize Via was still updating their design. Agner Fog doesn't have data for it. I was assuming the OP had a fallback version in case any of the necessary CPUID feature bits weren't set. (You normally need a fallback because AVX2 is new enough, and not available on even Skylake Pentium/Celeron). Probably better to get bigger speedups on Haswell+ with FMA than to be able to use the AVX2 version on Isaiah chips, unless worst-case speed on Via is critical. But thanks for that fun fact, I didn't know there were any physical CPUs where that was true. — Peter Cordes, Oct 01 '18 at 21:45
@PeterCordes Assuming what the VIA architect said is true, the next VIA chip is taping out now. And he was talking about how they had to jump through hoops to get the AVX512-VNNI spec. It's already implemented in the hardware, it's just a matter of deciding which CPUID bits to turn on as they verify that their implementation matches what it should be. He wasn't clear if the FMA is going to be microcoded. Anyway, next VIA should be able to run everything, but not necessarily efficiently. — Mysticial, Oct 01 '18 at 23:22
Can you provide the code that generated AVX512 instructions without AVX512 enabled? — Z boson, Oct 22 '18 at 12:21
I'm afraid I can't, it's a gigantic thing and I stopped using SVML completely. — Vojtěch Melda Meluzín, Oct 23 '18 at 08:44
@VojtěchMeldaMeluzín, what are you using instead of SVML now? — Z boson, Oct 29 '18 at 08:09
Nothing really, but I'm optimizing all vectorizable stuff manually using the stuff from VectorLib. — Vojtěch Melda Meluzín, Oct 30 '18 at 09:03
Correction to my earlier comment: AMD Excavator has BMI2. Agner Fog's instruction tables omits BMI2 instructions for Excavator, but he missed instructions for other recent AMD CPUs as well. http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0660F01_K15_Carrizo_InstLatX64.txt has Excavator cores, and has timings for BMI2 instructions including `pext` and `pdep` (very slow before Zen 3.) See also [Do all CPUs that support AVX2 also support BMI2 or popcnt?](https://stackoverflow.com/q/76428057) — Peter Cordes, Jun 08 '23 at 02:51

CLANG optimizing using SVML and it's autovectorization

0 Answers0

Linked