2

Basically, assuming you have a list of permutation indices at compile time, I am trying to understand the best order of instruction selection for x86_64.

I understand most of Agner Fog's optimization choices but there is one case I am having trouble understanding.

Given a permutation order that can be implemented as either;

    _mm256_permutevar8x32_epi32(r, _mm256_set_epi32(/* indicies */));

or

    __m256i tmp = _mm256_permute4x64_epi64(r, /* some mask */);
    return _mm256_shuffle_epi32(tmp, /* another mask */);

I don't see why the first option would ever be better.

Take the example of a permutation list 7, 6, 5, 4, 3, 2, 1, 0 (reverse epi32)

__m256i
load_perm(__m256i r) {
    // clang
    // 1 uop vmovaps (y, m) p23
    // 1 uop vpermps (y, y, y) p5

    // gcc
    // 1 uop vmovdqa (y, m) p23
    // 1 uop vpermd (y, y, y) p5
    return _mm256_permutevar8x32_epi32(r, _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7));
}

__m256i
perm_shuf(__m256i r) {
    // clang
    // 1 uop vmovaps (y, m) p23
    // 1 uop vpermps (y, y, y) p5

    // gcc
    // 1 uop vpermq (y, y, i) p5
    // 1 uop vpshufd (y, y, i) p5
    __m256i tmp = _mm256_permute4x64_epi64(r, 0x4e);
    return _mm256_shuffle_epi32(tmp, 0x1b);
}

Both options require 2 uop and given that there is dependency between the two instructions I don't think the ports really matter. The only difference I see then is that the first option adds an extra 32 bytes of .rodata.

Can anyone help me understand why Clang (and I guess Agner Fog) prefer the first option to the second?

here is a godbolt link with the compilation output for skylake

Noah
  • 1,647
  • 1
  • 9
  • 18

1 Answers1

1

For load_perm, clang seems to like to turn things into ps form. This saves code-size for legacy-SSE encoding (where SSE1 instructions have fewer prefixes). But not with VEX encodings, so there's no upside. Just clang's shuffle optimizer apparently not knowing or caring to preserve the integer vs. FP domain distinction. Which I think is fine for shuffles on current CPUs.

For perm_shuf, this is definitely clang's shuffle optimizer doing its job. Other compilers are less good at treating shuffle intrinsics the same way as they treat + and * operators: as ways to specify the desired result without necessarily specifying how you get there. e.g. x * y doesn't have to compile to imul for x86, and the choice can depend on surrounding code.

Most SIMD code runs in loops, so it's not a bad assumption that a shuffle constant will stay hot in cache and get used multiple times. Especially if this inlines and the shuffle vector can get hoisted. But even if not, it can be worth loading a constant. One shuffle is better than 2 for latency of the critical path from m input to return value, as well as port-5 uops on Intel CPUs (typically limited to 1 shuffle per clock from Haswell onward, until Ice Lake.)

BTW, m is a really poor choice of variable name: it arrives in a register, and you're using m in your comments to talk about memory constants.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Good call, changed ```m``` to ```r```. But, is the load not also on the critical path? According to Agner Fog's instruction table ```vmovaps``` has same latency as ```vpshufd``` and ```vpermq``` so don't see how the 32 byte of .rodata (and executable bloat) is made up for. Also what do you mean by "shuffle vectory can get hoisted"? Do you mean reuse the same register it was loaded into multiple times? – Noah Nov 05 '20 at 02:28
  • 1
    @Noah: No, the load address is effectively an immediate constant, available as part of decoding the instruction. (RIP-relative or 32-bit absolute depending on mode.) The load uop can be already executed before `v` (the data being shuffled) is ready, so only the `vpermd` latency is part of the critical path *from `v` to result*. Of course if an I-cache miss or something happened when calling this function, the load couldn't have started ahead of time, and/or the vector data load could miss in cache. – Peter Cordes Nov 05 '20 at 02:33
  • 1
    @Noah: Obviously it's much better if we're talking about a loop where the constant load can get hoisted. `vmovaps` from `.rodata` outside the loop, `vpermps` inside the loop. Then you only have 1 total uop for the shuffle, and any risk of cache miss is amortized over the number of loop iterations. https://en.wikipedia.org/wiki/Loop-invariant_code_motion#Benefits And BTW, if you're only loading once, you can compress the shuffle constant to 8 bytes by loading it with `vpmovzxbd` – Peter Cordes Nov 05 '20 at 02:35
  • Ah I see. As a side note then I am wondering why Agner Fog chooses to do ```_mm256_shufflelo_epi16``` and ```_mm256_shufflehi_epi16``` before ```_mm256_shuffle_epi8``` in his vector library [here](https://github.com/vectorclass/version2/blob/master/vectori256.h#L4102). Any ideas? e.g he prioritizes the case where you hit both above a single shuffle epi8. – Noah Nov 05 '20 at 02:42
  • @Noah: IDK, that's probably a worse choice in many common cases, unless register pressure is a problem for keeping an extra shuffle constant alive throughout a loop. Clang can in theory inform its choices based on surrounding code, while VCL's template metaprogramming can't. Also, avoiding memory constants is nice, and older CPUs used to have better shuffle throughput, so it's not totally unreasonable to make that choice when widening the 128-bit template logic to 256-bit. (`_mm256_shufflelo_epi16` requires AVX2, so Intel Haswell, which means 1/clock throughput, unlike `pshufl/hw` on SnB.) – Peter Cordes Nov 05 '20 at 02:50
  • So is it fair to say that if its on hot code path ```vpmov``` + ```perm``` is better and if its on cold code path ```perm``` + ```shuffle``` is better? – Noah Nov 05 '20 at 04:37
  • 1
    @Noah: Yeah, that's pretty reasonable. Although you might count a loop as "hot" even if it doesn't actually run often. Especially if the iteration count is the few times it does get entered, so you're amortizing the load over lots of uses. If code is truly cold, you often shouldn't vectorize it in the first place, or do so in a more compact way (like 128-bit vectors), if that's only a few % slower for the tiny fraction of time it spends running, if it makes the binary smaller. – Peter Cordes Nov 05 '20 at 04:46
  • 1 last question (that has strayed far from the origional), but how do you weight something like ```vpmov``` + ```vpermt2w``` vs ```vpblendw``` + ```vpshufd``` (blend_epi16 -> shuffle_epi32) given that ```vpermt2w``` has same number of uops as ```vpblendw``` + ```vpshufd``` but really high latency. – Noah Nov 05 '20 at 05:31
  • @Noah: `vpermt2w` is 3 uops on SKX (p015 + 2p5) and Ice Lake, vs. 2 total for your 2nd way so that's clearly better. Also, if AVX512 is an option, you can blend on any port, avoiding p5-only `vpblendw`. e.g. use `vmovdqu16` with merge-masking or [`vpblendmw`](https://www.felixcloutier.com/x86/vpblendmb:vpblendmw), with a blend bitmap in a `k` register. (Unfortunately that takes extra instructions to set up, so only worth it if you can hoist the `k` setup out of a loop, instead of using an immediate blend). – Peter Cordes Nov 05 '20 at 05:44
  • Depending on your shuffle, can you do a `vpermw ymm0{k1}, ymm1, ymm2` to blend into the destination? Unfortunately `vpermw` is still 2 uops on Ice Lake, despite `vpermb` being 1. Also, `vpmov` what? `vpmovzxbw ymm, [RIP + constant]`? That's 2 uops not micro-fused on Intel CPUs. – Peter Cordes Nov 05 '20 at 05:45
  • sorry for the continuous extra questions but if you have time, what about something like ```vpermq``` + ```vprolq``` (assuming its possible for the given permutation) vs ```vpmovd``` + ```vpermd```? And one last thing, assuming the next instruction has a dependency on the result of the permutation would there be any difference between the orders of the ```vpermq``` and ```vprolq``` (or whatever shuffle instruction was paired with ```vpermq```? – Noah Nov 05 '20 at 17:26
  • @Noah: `vpmovd` isn't an instruction. Are you talking about a `vmovdqa` or `vpmovzxbd` load of a shuffle control vector? Re: order: the only difference I can think of is a chance of bypass-forwarding delay if the input is coming from an FP instruction, or output is going to an FP instruction. See Intel's optimization manual for Skylake bypass latency between ports. I think between SIMD-integer instructions you'd be fine. (See also [Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says](https://stackoverflow.com/q/64116679)) – Peter Cordes Nov 05 '20 at 21:24
  • Sorry,```vpmovd``` -> ```vmovdqa```. So ```vpermq``` + ```vprolq``` vs ```vmovdqa``` + ```vpermd```. – Noah Nov 05 '20 at 21:29
  • @Noah: oh, so `vprolq` IIRC can run on ports other than port 5. It has better throughput than 2 shuffles, and only slightly worse latency than one `vpermd`. You'd still use `vpermd` inside most loops, of course. Check https://uops.info/ and consider the use-case. – Peter Cordes Nov 05 '20 at 21:32