So ymm8, ymm2 and ymm6 here is SRC1. is this true?
Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.
- AT&T:
op %src2, %src1, %dest
- Intel:
op dest, src1, src2
I don't want to use Intel syntax
Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.
Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel
and objdump -Mintel
to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas
style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.
gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128
, so they emit that.
On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128
is much faster than vperm2f128
. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128
is always a better choice than a vperm2f128
that does identical data movement.
gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.
Keep in mind that -O0
doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.