Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction?

Question

This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6 is source operand 1 that all of it, is cloned to ymm9 and then xmm1 is cloned to the ymm6[127-256] I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8, ymm2 and ymm6 here is SRC1. is this true?

vshufps     $68,  %ymm0, %ymm8, %ymm6
vshufps     $68,  %ymm4, %ymm2, %ymm1
Vinsertf128 $1,  %xmm1, %ymm6, %ymm9

And the main question is why gcc changed the instruction

row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);

to

Vinsertf128 $1,  %xmm1, %ymm6, %ymm9

and

row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);

to

Vperm2f128  $49, %ymm1, %ymm6, %ymm1

How could I ignore this optimization? I tried -O0 but doesn't work.

score 4 · Answer 1 · edited May 23 '17 at 12:17

So ymm8, ymm2 and ymm6 here is SRC1. is this true?

Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.

AT&T: op %src2, %src1, %dest
Intel: op dest, src1, src2

I don't want to use Intel syntax

Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.

Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel and objdump -Mintel to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.

gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128, so they emit that.

On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128 is much faster than vperm2f128. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128 is always a better choice than a vperm2f128 that does identical data movement.

gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.

Keep in mind that -O0 doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.

score 1 · Accepted Answer · answered Apr 04 '16 at 22:57

Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts and vpermilps. The broadcasts can only execute on port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions reduces the pressure on port 5 because vinsertf128 can execute on port 0. from IACA user guid

Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction?

2 Answers2