unexpected _mm256_shuffle_epi with __256i vectors

Question

I had seen this great answer on image conversions using __m128i, and thought I'd try and use AVX2 to see if I could get it any faster. The task is taking an input RGB image and converting it to RGBA (note the other question is BGRA, but that's not really a big difference...).

I can include more code if desired, but this stuff gets quite verbose and I'm stuck on something seemingly very simple. Suppose for this code that everything is 32-byte aligned, compiled with -mavx2, etc.

Given an input uint8_t *source RGB and output uint8_t *destination RGBA, it goes something like this (just trying to fill a quarter of the image in stripes [since this is vector land]).

#include <immintrin.h>
__m256i *src = (__m256i *) source;
__m256i *dest = (__m256i *) destination;

// for this particular image
unsigned width = 640;
unsigned height = 480;
unsigned unroll_N = (width * height) / 32;
for(unsigned idx = 0; idx < unroll_N; ++idx) {
    // Load first portion and fill all of dest[0]
    __m256i src_0 = src[0];
    __m256i tmp_0 = _mm256_shuffle_epi8(src_0,
        _mm256_set_epi8(
            0x80, 23, 22, 21,// A07 B07 G07 R07
            0x80, 20, 19, 18,// A06 B06 G06 R06
            0x80, 17, 16, 15,// A05 B05 G05 R05
            0x80, 14, 13, 12,// A04 B04 G04 R04
            0x80, 11, 10,  9,// A03 B03 G03 R03
            0x80,  8,  7,  6,// A02 B02 G02 R02
            0x80,  5,  4,  3,// A01 B01 G01 R01
            0x80,  2,  1,  0 // A00 B00 G00 R00
        )
    );

    dest[0] = tmp_0;

    // move the input / output pointers forward
    src  += 3;
    dest += 4;
}// end for

This doesn't even actually work. There are stripes showing up in each "quarter".

My understanding is 0x80 should be used to create 0x00 in the mask
- It doesn't really even matter what value gets there (it's the alpha channel, in the real code it gets OR'd with 0xff like the linked answer).
It somehow seems to be related to rows 04 to 07, if I make them all 0x80 leaving just 00-03 the inconsistencies go away.
- But of course, I'm not copying everything I need to.

What am I missing here? Like is it possible I ran out of registers or something? I'd be very surprised by that...

Using

_mm256_set_epi8(
    // 0x80, 23, 22, 21,// A07 B07 G07 R07
    // 0x80, 20, 19, 18,// A06 B06 G06 R06
    // 0x80, 17, 16, 15,// A05 B05 G05 R05
    // 0x80, 14, 13, 12,// A04 B04 G04 R04
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 11, 10,  9,// A03 B03 G03 R03
    0x80,  8,  7,  6,// A02 B02 G02 R02
    0x80,  5,  4,  3,// A01 B01 G01 R01
    0x80,  2,  1,  0 // A00 B00 G00 R00
)

You do `src += 3` but you process only *one* thing per iteration, that's 2/3rds just gone — harold, Oct 05 '17 at 09:45
Yeah, I've omitted the code that does everything else for the sake of brevity. That's what "one quarter" was supposed to mean x0 — svenevs, Oct 05 '17 at 09:47
OK, not very clear. Anyway, `_mm256_shuffle_epi8` is not a generalization of `_mm_shuffle_epi8`, it acts like two `_mm_shuffle_epi8`'s side-by-side. So putting indexes like 16 and up is not useful. — harold, Oct 05 '17 at 09:51
Ah! I see, yes that seems to be the real problem here. I added images but the rest of the code worked under the same (false) assumption...I will have to re-think this one then. Thanks @harold! — svenevs, Oct 05 '17 at 10:01
Sometimes 256b vectors just aren't a win, especially if you need more than a `vpermq` at the end to correct for in-lane behaviour. AVX still helps vs. SSE4.2 because of 3-operand instructions reducing front-end bottlenecks. (Avoids a lot of MOVDQA instructions). Intel Haswell and later (i.e. Intel AVX2 CPUs) only have 1 shuffle port, but can run 2 loads and 1 store per clock, so you often bottleneck on shuffle throughput for stuff like this. Shifts or unaligned loads to replace shuffles can sometimes help. (See Intel's optimization manual, and https://stackoverflow.com/tags/x86/info) — Peter Cordes, Oct 05 '17 at 10:07
Ugh, I'm starting to get that now. I've been benchmarking this along the way, the original `__m128i` post is only ~20 microseconds slower than my (really wrong) `__m256i` approach. Thanks for the feedback. I'll leave the question open for a day to see if anybody has an idea of a good fix, but if not I'll just ask @harold to make his comment an answer ;) — svenevs, Oct 05 '17 at 10:17
OK how's this: load a 128b piece, then `vinserti128` the corresponding piece from the next iteration (inserting from memory does not count as a shuffle) and effectively use the SSSE3 version of the loop but with two iterations at once. Probably bottlenecked by stores though.. — harold, Oct 05 '17 at 10:29
Good thought, though your suspicions were correct (I believe) in terms of saturation. I didn't quite get a full version working, but performance was degraded after only getting half the rewrite through. Perhaps I was wasteful with how I approached it, but I'm just going to stick with the 128bit version. Please make your comment about `_mm256_shuffle_epi8` an answer so I can accept it, as that was the root problem with my code. Thank you both for your thoughts and suggestions, this has been an informative experience! — svenevs, Oct 05 '17 at 22:34
Related: [What is the inverse of "\_mm256\_cvtepi16\_epi32"](https://stackoverflow.com/q/49721807) re: how to work around the inconvenient design for unpack / repack. — Peter Cordes, Jul 29 '23 at 19:18

score 5 · Accepted Answer · answered Oct 06 '17 at 08:59

5

_mm256_shuffle_epi8 works like two times an _mm_shuffle_epi8 side-by-side, instead of like a more useful (but probably higher latency) full-width shuffle that can put any byte anywhere. Here's a diagram from www.officedaytime.com/simd512e:

AVX512VBMI has new byte-granularity shuffles such as vpermb that can cross lanes, but current processors don't support that instruction set extension yet.

answered Oct 06 '17 at 08:59

harold

61,398
6
86
164

`vpermb` and `vpermt2b` will probably be worse throughput as well as latency :(. `vpermt2w` on skylake-avx512 is 3 uops (p0 + 2p5), 7c latency 2c throughput. https://github.com/InstLatx64/InstLatx64 for spreadsheets from IACA. So in-lane + `vpermq` is at least as fast, and will get the job done in some cases. (e.g. for `packsswb`. It turns out `vpmovwb` is also 2 uops for p5, even though it's a 1-input shuffle that produces a half-width result. All the `pvmov` shuffles are 2 uop except `vpmovqd`, the only one with a dword destination element size and truncation instead of saturation.) – Peter Cordes Oct 06 '17 at 09:47
`vpermt2d/q` are single-uop 3c latency on SKX, though. In general it's only the lane-crossing small-element shuffles (including `vpermw`) that are more uops. Maybe in the far future some CPUs will implement even the most complex as single-uop, or at least have more shuffle ports, but it will probably sometimes be worth it to use more instructions but fewer uops sometimes. I realized recently that a shuffle with merge-masking gives you a limited form of 2-input shuffle, which is pretty damn cool (unless it takes an extra ALU uop to merge). I haven't yet found any great use-cases, though. – Peter Cordes Oct 06 '17 at 09:49

unexpected _mm256_shuffle_epi with __256i vectors

1 Answers1

Linked

Related