I had this question Windows C++ fast RGBA32 DX texture to RGB24 buffer.
Here is a question about a more generalized problem of one possible solution to that question.
Stream 1 composed of 4 bytes units:
B0.0 B0.1 B0.2 B0.3 B1.0 B1.1 B1.2 B1.3 ...
Stream 2 composed of 3 bytes units:
B0.0 B0.1 B0.2 B1.0 B1.1 B1.2 ...
What is the high efficiency way to convert to and back beween them?
(1) Some assembly question indicates in many simple case like this the compiler can do the job better, how to write the C code to let the compiler do the best optimization?
(1.1) Expansion : process multiple units in the loop
(1.2) Operate directly on bytes or operate on larger data with logic operations, which could be better?
(2) X86 Assembly : is there vector instruction that can do this type of pack and unpack on multiple units?