0

In my RGB to grey case:

Y = (77*R + 150*G + 29*B) >> 8;

I know SIMD (NEON, SSE2) can do like:

foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = 77*{R0,R1,R2,R3,R4,R5,R6,R7}
{B0,B1,B2,B3,B4,B5,B6,B7} = 150*{G0,G1,G2,G3,G4,G5,G6,G7}
{C0,C1,C2,C3,C4,C5,C6,C7} = 29*{B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8

However, the multiply instruction take at least 2 clock cycles, and R,G,B in [0-255], we can use three lookup table(an array, length=256) to store the partial result of 77*R(mark as X), 150*G(mark as Y), 29*B(mark as Z). So I'm looking for instructions can do the intention:

foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = {X[R0],X[R1],X[R2],X[R3],X[R4],X[R5],X[R6],X[R7]}
{B0,B1,B2,B3,B4,B5,B6,B7} = {Y[G0],Y[G1],Y[G2],Y[G3],Y[G4],Y[G5],Y[G6],Y[G7]}
{C0,C1,C2,C3,C4,C5,C6,C7} = {Z[B0],Z[B1],Z[B2],Z[B3],Z[B4],Z[B5],Z[B6],Z[B7]}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8

Any good suggestions?

Dabo
  • 51
  • 1
  • 4

1 Answers1

1

There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.

You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.

So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.

Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.


Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.

If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847