1

I had this question Windows C++ fast RGBA32 DX texture to RGB24 buffer.

Here is a question about a more generalized problem of one possible solution to that question.

Stream 1 composed of 4 bytes units:

B0.0 B0.1 B0.2 B0.3 B1.0 B1.1 B1.2 B1.3 ...

Stream 2 composed of 3 bytes units:

B0.0 B0.1 B0.2 B1.0 B1.1 B1.2 ...

What is the high efficiency way to convert to and back beween them?

(1) Some assembly question indicates in many simple case like this the compiler can do the job better, how to write the C code to let the compiler do the best optimization?

(1.1) Expansion : process multiple units in the loop

(1.2) Operate directly on bytes or operate on larger data with logic operations, which could be better?

(2) X86 Assembly : is there vector instruction that can do this type of pack and unpack on multiple units?

jw_
  • 1,663
  • 18
  • 32
  • 1
    Yes, you can do this with SSSE3 `pshufb`, `_mm_shuffle_epi8`, reading 16 bytes and writing 12 bytes (actually 16 unaligned bytes, overlapping the previous vector by 4 in the output stream). Doing it without SSSE3 is probably not worth it; for that you'd want scalar dword loads and overlapping dword stores. (Not 3x byte loads/stores!) – Peter Cordes Apr 19 '20 at 00:56
  • IDK if any of the big 4 compilers will auto-vectorize this for you, whether you write it as `uint32_t` loads and overlapping 3-byte stores with `memcpy`, or as separate byte loads and stores. – Peter Cordes Apr 19 '20 at 00:57
  • 1
    Looks like gcc will: https://godbolt.org/z/JS_s3W – Nate Eldredge Apr 19 '20 at 01:46
  • Found [Fast method to copy memory with translation - ARGB to BGR](https://stackoverflow.com/a/6804399) which packs (and reverses). You can change the shuffle vector to *just* pack. (The other direction is [Fast vectorized conversion from RGB to BGRA](https://stackoverflow.com/q/7194452).) – Peter Cordes Apr 19 '20 at 02:05
  • @NateEldredge: Oh wow, instead of just doing overlapping stores, GCC is shuffling data between vectors so it can do full 16-byte stores into the destination. Interesting. That's likely not profitable, probably bottlenecking on shuffle port bandwidth, using 6 `pshufb` for 3x16 bytes of output. That's 8 bytes of output per pshufb vs. 12 bytes per pshufb the normal overlap way. So it's not terrible and might keep up with memory if you miss in L3. – Peter Cordes Apr 19 '20 at 02:11

0 Answers0