7

I need to combine two arrays into a third in chunks of four. Specifically, for input arrays

    A0, A1, A2, A3, A4, A5, A6, A7 ...
    B0, B1, B2, B3, B4, B5, B6, B7 ...

the output should be

    A0 A1 A2 A3   B0 B1 B2 B3   A4 A5 A6 A7   B4 B5 B6 B7, ...,

In a sense, this is the reverse of the de-interleave question asked in Fastest de-interleave operation in C?

For some extra fun, the two buffers contain elements that are respectively eight and sixteen bits wide. I have written some code to do this but profiling indicates that it is spending a lot of time, so I am looking for ways to speed it up. As my target CPU (LEON) does not provide them, SIMD intrinsics are not an option. My CPU has a word length of 16 bits.

I have tried different ways of doing the loop and this is the fastest version I have so far:

#include <stdint.h>

#define BUFSZ 1024

register int i;
int8_t  A[BUFSZ]; // 1st buffer
int16_t B[BUFSZ]; // 2nd buffer
int16_t interleaved[2*BUFSZ]; // the two buffers combined

int8_t  *pA;
int16_t *pB, *pinterleaved;

        pinterleaved=interleaved;
        for(i=BUFSZ/4, pA=A, pB=B; i-->0; pinterleaved+=8, pA+=4, pB+=4){
                pinterleaved[0]=pA[0]; pinterleaved[1]=pA[1]; pinterleaved[2]=pA[2]; pinterleaved[3]=pA[3];
                pinterleaved[4]=pB[0]; pinterleaved[5]=pB[1]; pinterleaved[6]=pB[2]; pinterleaved[7]=pB[3];
        }       

Any ideas for a faster implementation?

MikeLima
  • 79
  • 3
  • 2
    Looks pretty tight to me. Have you tried writing 32 bits at a time instead of 16? - I mean, have `interleaved` be an array of `int32_t`. I seem to recall that certain CPU models are slower with 16-bit memory writes - I don't have experience with yours specifically. – 500 - Internal Server Error Sep 20 '19 at 07:31
  • 1
    Post the assembly that the compiler generates for this code. – EOF Sep 20 '19 at 07:36
  • I forgot to mention that I have a 16 bit CPU, so doing it with 32 bits should be slower, although I have not tested it. – MikeLima Sep 20 '19 at 07:36
  • Is the decrementing `i` part of a performance improvement? – Support Ukraine Sep 20 '19 at 09:08
  • It's a bit off topic but instead of optimizing this task, did you check that the task is really necessary ? Can't you just use A and B directly in the following of your program instead of copying everything in `interleaved` ? I don't see the added value of this copy, espacilly since both arrays seem to contain heterogeneous data. When optimizing some code, you have to take the big picture and not necessarily focus on just one function. – Guillaume Petitjean Sep 20 '19 at 09:17
  • @Guillaume Petitjean: unfortunately the copying is necessary as the combined data are meant to be send via DMA to a hardware accelerator – MikeLima Sep 20 '19 at 09:29
  • @4386427: Sort of, although counting upwards is not much different in terms of speed. – MikeLima Sep 20 '19 at 09:32
  • As @EOF said, it would be good to look at the assembly to find out if something can be optimized. – Guillaume Petitjean Sep 20 '19 at 09:38
  • It's unlikely you can optimize this C code given the shortcomings of the CPU, unless the assembly shows something particularly egregious. – Veedrac Sep 20 '19 at 11:21
  • @PSkocik: Why do you believe the OP’s LEON processor has SIMD instructions? – Eric Postpischil Sep 20 '19 at 13:39

1 Answers1

3

Optimizing performance is often a very system specific task. So my observation may not be valid on your system.

Anyway, FWIW, on my system I see a performance improvement by replacing the 4 last assigns (those using pB) with a memcpy.

I replaced:

pinterleaved[4]=pB[0]; pinterleaved[5]=pB[1]; pinterleaved[6]=pB[2]; pinterleaved[7]=pB[3];

with

memcpy(pinterleaved + 4, pB, 4 * sizeof *pB);

and got a > 25% performance improvement.

Support Ukraine
  • 42,271
  • 4
  • 38
  • 63
  • This may be a bit crazy, but if OP benefits from `memcpy`, I would also try temp buffer for 8 `uint8_t`, fill appropriate position with the short values and `memcpy` it to the destination: `int8_t T[8]; T[1] = pA[0]; T[3] = pA[1]; T[5] = pA[2]; T[7] = pA[3]; memcpy(pinterleaved, T, 8);` – Vlad Feinstein Oct 03 '19 at 23:59