Basically, assuming you have a list of permutation indices at compile time, I am trying to understand the best order of instruction selection for x86_64.
I understand most of Agner Fog's optimization choices but there is one case I am having trouble understanding.
Given a permutation order that can be implemented as either;
_mm256_permutevar8x32_epi32(r, _mm256_set_epi32(/* indicies */));
or
__m256i tmp = _mm256_permute4x64_epi64(r, /* some mask */);
return _mm256_shuffle_epi32(tmp, /* another mask */);
I don't see why the first option would ever be better.
Take the example of a permutation list 7, 6, 5, 4, 3, 2, 1, 0
(reverse epi32)
__m256i
load_perm(__m256i r) {
// clang
// 1 uop vmovaps (y, m) p23
// 1 uop vpermps (y, y, y) p5
// gcc
// 1 uop vmovdqa (y, m) p23
// 1 uop vpermd (y, y, y) p5
return _mm256_permutevar8x32_epi32(r, _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7));
}
__m256i
perm_shuf(__m256i r) {
// clang
// 1 uop vmovaps (y, m) p23
// 1 uop vpermps (y, y, y) p5
// gcc
// 1 uop vpermq (y, y, i) p5
// 1 uop vpshufd (y, y, i) p5
__m256i tmp = _mm256_permute4x64_epi64(r, 0x4e);
return _mm256_shuffle_epi32(tmp, 0x1b);
}
Both options require 2 uop and given that there is dependency between the two instructions I don't think the ports really matter. The only difference I see then is that the first option adds an extra 32 bytes of .rodata.
Can anyone help me understand why Clang (and I guess Agner Fog) prefer the first option to the second?
here is a godbolt link with the compilation output for skylake