I need to combine two arrays into a third in chunks of four. Specifically, for input arrays
A0, A1, A2, A3, A4, A5, A6, A7 ...
B0, B1, B2, B3, B4, B5, B6, B7 ...
the output should be
A0 A1 A2 A3 B0 B1 B2 B3 A4 A5 A6 A7 B4 B5 B6 B7, ...,
In a sense, this is the reverse of the de-interleave question asked in Fastest de-interleave operation in C?
For some extra fun, the two buffers contain elements that are respectively eight and sixteen bits wide. I have written some code to do this but profiling indicates that it is spending a lot of time, so I am looking for ways to speed it up. As my target CPU (LEON) does not provide them, SIMD intrinsics are not an option. My CPU has a word length of 16 bits.
I have tried different ways of doing the loop and this is the fastest version I have so far:
#include <stdint.h>
#define BUFSZ 1024
register int i;
int8_t A[BUFSZ]; // 1st buffer
int16_t B[BUFSZ]; // 2nd buffer
int16_t interleaved[2*BUFSZ]; // the two buffers combined
int8_t *pA;
int16_t *pB, *pinterleaved;
pinterleaved=interleaved;
for(i=BUFSZ/4, pA=A, pB=B; i-->0; pinterleaved+=8, pA+=4, pB+=4){
pinterleaved[0]=pA[0]; pinterleaved[1]=pA[1]; pinterleaved[2]=pA[2]; pinterleaved[3]=pA[3];
pinterleaved[4]=pB[0]; pinterleaved[5]=pB[1]; pinterleaved[6]=pB[2]; pinterleaved[7]=pB[3];
}
Any ideas for a faster implementation?