0

I have seen a lot of assembly with AVX(all three flavors), and in all the cases that I have seen the most concentrated a kind of instruction is the best the code performs. But, for example, things like doing a load into a 32-bit register and then broadcasting into all lanes for AVX512 or AVX2 say...isn't it better better/faster to just load a register, and then permute away?

In particular, is there a penalty for mixing x86_64 and the AVX family of instructions? I know there is such a penalty for SSE and AVX mixing in general.

I know I could test this, but I'd rather poke the knowledge of the masses before I take on that mini-project. I am sure someone already knows this.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
JLV
  • 1
  • How could you *not* "mix" the instructions? The AVX instructions only work on AVX registers and their operations. Everything else (most importantly control flow but also just about anything else) can not be done using AVX instructions. – Some programmer dude Feb 19 '18 at 07:57
  • The question is if there is a penalty. In other words, is it worth it to spend the time to reduce the x64_32 instructions in my critical path or not? – JLV Feb 19 '18 at 08:03
  • 1
    Unless you need to move data out of the AVX registers (or into such registers) then there should be no penalty. And as instructions like these are often used in loops, being able to "mix" with control-flow instructions and other non-AVX instructions (to increase/decrease and compare against non-AVX values) is kind of a requirement. – Some programmer dude Feb 19 '18 at 08:36
  • 1
    And the only way to make sure is to *measure!* And even then you should not fall into the trap of premature optimization. Unless you want to write yourself in assembly language, let the higher-level compiler (with optimization enabled) handle it all. Use compiler [intrinsic functions](https://en.wikipedia.org/wiki/Intrinsic_function) if you need AVX specifically and the compiler doesn't generate such code automatically. And if you need high parallelism and SIMD-like instructions consider using a language specialized for that. – Some programmer dude Feb 19 '18 at 08:39
  • 2
    There is no penalty on the level of the SSE/AVX mode switch (which btw is removed in Skylake). Nevertheless you should broadcast-from-memory where possible, it's "free" (no worse than a normal load) while anything else is not. – harold Feb 19 '18 at 09:49
  • Related: [What is the penalty of mixing EVEX and VEX encoded scheme?](https://stackoverflow.com/questions/46080327/what-is-the-penalty-of-mixing-evex-and-vex-encoded-scheme): answer: no penalty. – Peter Cordes Feb 19 '18 at 15:24
  • You had tagged this with `[compiler-optimization]`, but from the text it sounds like you're talking about hand-written asm, not intrinsics. Anyway, go read http://agner.org/optimize/ to learn more about tuning asm for modern CPUs. (Also other performance links in the [x86 tag wiki](https://stackoverflow.com/tags/x86/info). `vmovd` / `vmovq` aren't free, but they're not "special" or extra expensive, and have nice low latency on Intel CPUs. `vpextrd` / `vpinsrd` between integer and xmm costs a `movd` + a shuffle, so avoid when possible. `vpbroadcastd` is by far the best if data starts in mem – Peter Cordes Feb 19 '18 at 15:32
  • @PeterCordes: Thank you for the references. I am using intrinsics, I thought vpbroadcastd would be the best, but the compiler turns my intrinsic into "move reg32, mem" followed by "vpbroadcastd [reg256 or reg512] reg32"...I don't know how to tell it to just do the broadcast and skip the loading into a x86 or x64 register. – JLV Feb 19 '18 at 19:19
  • AVX512 is still pretty new; sounds like a missed-optimization. That's just a bug that should be fixed in the compiler source, not something you can control with an option. (There aren't CPUs where mem->integer->zmm is better, so there'd be no point having an option to ask for that). If you're using an old compiler version, maybe it already has been fixed. e.g. gcc 6.4 or 7.3 are current. Are you compiling with optimization enabled, though? clang or gcc `-O3 -march=skylake-avx512`? Can you link a test-case on http://gcc.godbolt.org/? – Peter Cordes Feb 19 '18 at 19:30
  • And BTW, only AVX512 has `vpbroadcastd` with a GP-register source. AVX2 has broadcast with an XMM register source, while AVX1 *only* has `vbroadcastss/sd` with a memory source. Notice how the register-source version of [`vbroadcastss`](https://github.com/HJLebbink/asm-dude/wiki/VBROADCAST) is AVX2-only. – Peter Cordes Feb 19 '18 at 19:32
  • The case I am referring to is AVX512, I assumed the same would be true of my cases for AVX2, but I did not check the disassembly in those...had enough trouble dealing with avx512. I am using ICC 2017. And yes, I am compiling with -O3 and several other optimization goodies. – JLV Feb 19 '18 at 20:00

0 Answers0