1

I had seen this great answer on image conversions using __m128i, and thought I'd try and use AVX2 to see if I could get it any faster. The task is taking an input RGB image and converting it to RGBA (note the other question is BGRA, but that's not really a big difference...).

I can include more code if desired, but this stuff gets quite verbose and I'm stuck on something seemingly very simple. Suppose for this code that everything is 32-byte aligned, compiled with -mavx2, etc.

Given an input uint8_t *source RGB and output uint8_t *destination RGBA, it goes something like this (just trying to fill a quarter of the image in stripes [since this is vector land]).

#include <immintrin.h>
__m256i *src = (__m256i *) source;
__m256i *dest = (__m256i *) destination;

// for this particular image
unsigned width = 640;
unsigned height = 480;
unsigned unroll_N = (width * height) / 32;
for(unsigned idx = 0; idx < unroll_N; ++idx) {
    // Load first portion and fill all of dest[0]
    __m256i src_0 = src[0];
    __m256i tmp_0 = _mm256_shuffle_epi8(src_0,
        _mm256_set_epi8(
            0x80, 23, 22, 21,// A07 B07 G07 R07
            0x80, 20, 19, 18,// A06 B06 G06 R06
            0x80, 17, 16, 15,// A05 B05 G05 R05
            0x80, 14, 13, 12,// A04 B04 G04 R04
            0x80, 11, 10,  9,// A03 B03 G03 R03
            0x80,  8,  7,  6,// A02 B02 G02 R02
            0x80,  5,  4,  3,// A01 B01 G01 R01
            0x80,  2,  1,  0 // A00 B00 G00 R00
        )
    );

    dest[0] = tmp_0;

    // move the input / output pointers forward
    src  += 3;
    dest += 4;
}// end for

This doesn't even actually work. There are stripes showing up in each "quarter".

  • My understanding is 0x80 should be used to create 0x00 in the mask
    • It doesn't really even matter what value gets there (it's the alpha channel, in the real code it gets OR'd with 0xff like the linked answer).
  • It somehow seems to be related to rows 04 to 07, if I make them all 0x80 leaving just 00-03 the inconsistencies go away.
    • But of course, I'm not copying everything I need to.

What am I missing here? Like is it possible I ran out of registers or something? I'd be very surprised by that...

Image with both parts of shuffle

Using

_mm256_set_epi8(
    // 0x80, 23, 22, 21,// A07 B07 G07 R07
    // 0x80, 20, 19, 18,// A06 B06 G06 R06
    // 0x80, 17, 16, 15,// A05 B05 G05 R05
    // 0x80, 14, 13, 12,// A04 B04 G04 R04
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 11, 10,  9,// A03 B03 G03 R03
    0x80,  8,  7,  6,// A02 B02 G02 R02
    0x80,  5,  4,  3,// A01 B01 G01 R01
    0x80,  2,  1,  0 // A00 B00 G00 R00
)

using the above shuffle instead

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
svenevs
  • 833
  • 9
  • 24
  • You do `src += 3` but you process only *one* thing per iteration, that's 2/3rds just gone – harold Oct 05 '17 at 09:45
  • Yeah, I've omitted the code that does everything else for the sake of brevity. That's what "one quarter" was supposed to mean x0 – svenevs Oct 05 '17 at 09:47
  • 2
    OK, not very clear. Anyway, `_mm256_shuffle_epi8` is not a generalization of `_mm_shuffle_epi8`, it acts like two `_mm_shuffle_epi8`'s side-by-side. So putting indexes like 16 and up is not useful. – harold Oct 05 '17 at 09:51
  • Ah! I see, yes that seems to be the real problem here. I added images but the rest of the code worked under the same (false) assumption...I will have to re-think this one then. Thanks @harold! – svenevs Oct 05 '17 at 10:01
  • 1
    Sometimes 256b vectors just aren't a win, especially if you need more than a `vpermq` at the end to correct for in-lane behaviour. AVX still helps vs. SSE4.2 because of 3-operand instructions reducing front-end bottlenecks. (Avoids a lot of MOVDQA instructions). Intel Haswell and later (i.e. Intel AVX2 CPUs) only have 1 shuffle port, but can run 2 loads and 1 store per clock, so you often bottleneck on shuffle throughput for stuff like this. Shifts or unaligned loads to replace shuffles can sometimes help. (See Intel's optimization manual, and https://stackoverflow.com/tags/x86/info) – Peter Cordes Oct 05 '17 at 10:07
  • Ugh, I'm starting to get that now. I've been benchmarking this along the way, the original `__m128i` post is only ~20 microseconds slower than my (really wrong) `__m256i` approach. Thanks for the feedback. I'll leave the question open for a day to see if anybody has an idea of a good fix, but if not I'll just ask @harold to make his comment an answer ;) – svenevs Oct 05 '17 at 10:17
  • 1
    OK how's this: load a 128b piece, then `vinserti128` the corresponding piece from the next iteration (inserting from memory does not count as a shuffle) and effectively use the SSSE3 version of the loop but with two iterations at once. Probably bottlenecked by stores though.. – harold Oct 05 '17 at 10:29
  • Good thought, though your suspicions were correct (I believe) in terms of saturation. I didn't quite get a full version working, but performance was degraded after only getting half the rewrite through. Perhaps I was wasteful with how I approached it, but I'm just going to stick with the 128bit version. Please make your comment about `_mm256_shuffle_epi8` an answer so I can accept it, as that was the root problem with my code. Thank you both for your thoughts and suggestions, this has been an informative experience! – svenevs Oct 05 '17 at 22:34
  • Related: [What is the inverse of "\_mm256\_cvtepi16\_epi32"](https://stackoverflow.com/q/49721807) re: how to work around the inconvenient design for unpack / repack. – Peter Cordes Jul 29 '23 at 19:18

1 Answers1

5

_mm256_shuffle_epi8 works like two times an _mm_shuffle_epi8 side-by-side, instead of like a more useful (but probably higher latency) full-width shuffle that can put any byte anywhere. Here's a diagram from www.officedaytime.com/simd512e:

vpshufb

AVX512VBMI has new byte-granularity shuffles such as vpermb that can cross lanes, but current processors don't support that instruction set extension yet.

harold
  • 61,398
  • 6
  • 86
  • 164
  • `vpermb` and `vpermt2b` will probably be worse throughput as well as latency :(. `vpermt2w` on skylake-avx512 is 3 uops (p0 + 2p5), 7c latency 2c throughput. https://github.com/InstLatx64/InstLatx64 for spreadsheets from IACA. So in-lane + `vpermq` is at least as fast, and will get the job done in some cases. (e.g. for `packsswb`. It turns out `vpmovwb` is also 2 uops for p5, even though it's a 1-input shuffle that produces a half-width result. All the `pvmov` shuffles are 2 uop except `vpmovqd`, the only one with a dword destination element size and truncation instead of saturation.) – Peter Cordes Oct 06 '17 at 09:47
  • `vpermt2d/q` are single-uop 3c latency on SKX, though. In general it's only the lane-crossing small-element shuffles (including `vpermw`) that are more uops. Maybe in the far future some CPUs will implement even the most complex as single-uop, or at least have more shuffle ports, but it will probably sometimes be worth it to use more instructions but fewer uops sometimes. I realized recently that a shuffle with merge-masking gives you a limited form of 2-input shuffle, which is pretty damn cool (unless it takes an extra ALU uop to merge). I haven't yet found any great use-cases, though. – Peter Cordes Oct 06 '17 at 09:49