Is 32 bit image processing faster than 24 bit image processing when simd instructions are used?

Question

I had a look on the sse and mmx instruction set and there are no instructions for 3 channel image processing. Of course, for many operations you can use the same instructions, such as averaging two images. But when it comes to operations like unshuffling the channels or mixing different channels by a linear transformation, it seems a lot easier to use 32 bit images.

How are the performance chararteristics of typical image processing tasks with 24 vs. 32 bit images?

score 4 · Accepted Answer · answered Aug 10 '12 at 14:57

24 bit/pixel are faster if your images are large and the operations are simple (such as alpha-blending etc).

Very often the operations in image processing are quite simple, but you execute millions of them. So the time used to move data in and out from main-memory to the CPU can easily dominate the performance of an algorithm.

Therefore 24 bit/pixel images can offer an advantage over 32 bit/pixel images because there is 1/4 less data to move around.

Writing image-processing code that performs well with 24 bit/pixel is a pain though. The SSE instructions don't really fit the data, so you have to shuffle bytes around, and then you have to deal with all the different alignments.

If the images you are working with are small and fit in the l1 or l2 cache, things are different and the CPU time will dominate the performance. In these cases 32 bit/pixel performs faster.

Actually if I remember correctly you can do patching (I think this is called buffering), and do the work on top of patches or blocks of a given size that will fit the L1 or L2 cache. Then you move your working block along the matrix. If the flop count is high then the fast cached access will offset the cost of copying the patch. — SkyWalker, Aug 12 '12 at 21:21

score 3 · Answer 2 · answered Aug 10 '12 at 14:54

On new x86 CPUs with PSHUFB (aka _mm_shuffle_epi8) splitting the channels can be done in few cycles, and it can be cheaper than incurring additional memory accesses due to extending pixel width to 32 bits. On old x86 CPUs without PSHUFB it requires a lot of shuffles or unpacking instructions, and 32-bit pixels are much more efficient.

On ARM CPUs with NEON splitting the channels can be done for free by the load-store unit. On ARM CPUs without NEON splitting the channels can be done with ARMv6 SIMD instructions at the cost of about 3 instructions per pixel.

I overlooked that PSHUFB instructions. Good hint. Thx. – Ralph Tandetzky Aug 11 '12 at 11:03 — Ralph Tandetzky, Aug 11 '12 at 11:03

Is 32 bit image processing faster than 24 bit image processing when simd instructions are used?

2 Answers2