Cortex-M4 SIMD slower than Scalar

Question

I have a few place in my code that could really use some speed up, when I try to use CM4 SIMD instructions the result is always slower than scalar version, for example, this is an alpha blending function I'm using a lot, it's not really slow but it serves as an example:

for (int y=0; y<h; y++) {
    i=y*w;
    for (int x=0; x<w; x++) {
        uint spix = *srcp++;
        uint dpix = dstp[i+x];
        uint r=(alpha*R565(spix)+(256-alpha)*R565(dpix))>>8;
        uint g=(alpha*G565(spix)+(256-alpha)*G565(dpix))>>8;
        uint b=(alpha*B565(spix)+(256-alpha)*B565(dpix))>>8;
        dstp[i+x]= RGB565(r, g, b);
    }
}

R565, G565, B565 and RGB565 are macros that extract and pack RGB565 respectively, please ignore

Now I tried using __SMUAD and see if anything changes, the result was slower (or same speed as the original code) even tried loop unrolling, with no luck:

uint v0, vr, vg, vb;
v0 = (alpha<<16)|(256-alpha);
for (int y=0; y<h; y++) {
    i=y*w;
    for (int x=0; x<w; x++) {
        spix = *srcp++;
        dpix = dstp[i+x];

        uint vr = R565(spix)<<16 | R565(dpix);
        uint vg = G565(spix)<<16 | G565(dpix);
        uint vb = B565(spix)<<16 | B565(dpix);

        uint r = __SMUAD(v0, vr)>>8;
        uint g = __SMUAD(v0, vg)>>8;
        uint b = __SMUAD(v0, vb)>>8;

        dstp[i+x]= RGB565(r, g, b);
    }
}

I know this has been asked before, but given the architectural differences, and the fact that none of the answers really solve my problem, I'm asking again. Thanks!

Update

Scalar disassembly:

    Disassembly of section .text.blend:

00000000 <blend>:
   0:   e92d 0ff0   stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
   4:   6846        ldr r6, [r0, #4]
   6:   68c4        ldr r4, [r0, #12]
   8:   b086        sub sp, #24
   a:   199e        adds    r6, r3, r6
   c:   9601        str r6, [sp, #4]
   e:   9200        str r2, [sp, #0]
  10:   68ca        ldr r2, [r1, #12]
  12:   f89d 5038   ldrb.w  r5, [sp, #56]   ; 0x38
  16:   9204        str r2, [sp, #16]
  18:   9a01        ldr r2, [sp, #4]
  1a:   426e        negs    r6, r5
  1c:   4293        cmp r3, r2
  1e:   b2f6        uxtb    r6, r6
  20:   da5b        bge.n   da <blend+0xda>
  22:   8809        ldrh    r1, [r1, #0]
  24:   6802        ldr r2, [r0, #0]
  26:   9102        str r1, [sp, #8]
  28:   fb03 fb01   mul.w   fp, r3, r1
  2c:   9900        ldr r1, [sp, #0]
  2e:   4411        add r1, r2
  30:   9103        str r1, [sp, #12]
  32:   0052        lsls    r2, r2, #1
  34:   9205        str r2, [sp, #20]
  36:   9903        ldr r1, [sp, #12]
  38:   9a00        ldr r2, [sp, #0]
  3a:   428a        cmp r2, r1
  3c:   fa1f fb8b   uxth.w  fp, fp
  40:   da49        bge.n   d6 <blend+0xd6>
  42:   4610        mov r0, r2
  44:   4458        add r0, fp
  46:   f100 4000   add.w   r0, r0, #2147483648 ; 0x80000000
  4a:   9a04        ldr r2, [sp, #16]
  4c:   f8dd a014   ldr.w   sl, [sp, #20]
  50:   3801        subs    r0, #1
  52:   eb02 0040   add.w   r0, r2, r0, lsl #1
  56:   44a2        add sl, r4
  58:   f834 1b02   ldrh.w  r1, [r4], #2
  5c:   8842        ldrh    r2, [r0, #2]
  5e:   f3c1 07c4   ubfx    r7, r1, #3, #5
  62:   f3c2 09c4   ubfx    r9, r2, #3, #5
  66:   f001 0c07   and.w   ip, r1, #7
  6a:   f3c1 2804   ubfx    r8, r1, #8, #5
  6e:   fb07 f705   mul.w   r7, r7, r5
  72:   0b49        lsrs    r1, r1, #13
  74:   fb06 7709   mla r7, r6, r9, r7
  78:   ea41 01cc   orr.w   r1, r1, ip, lsl #3
  7c:   f3c2 2904   ubfx    r9, r2, #8, #5
  80:   f002 0c07   and.w   ip, r2, #7
  84:   fb08 f805   mul.w   r8, r8, r5
  88:   0b52        lsrs    r2, r2, #13
  8a:   fb01 f105   mul.w   r1, r1, r5
  8e:   097f        lsrs    r7, r7, #5
  90:   fb06 8809   mla r8, r6, r9, r8
  94:   ea42 02cc   orr.w   r2, r2, ip, lsl #3
  98:   fb06 1202   mla r2, r6, r2, r1
  9c:   f007 07f8   and.w   r7, r7, #248    ; 0xf8
  a0:   f408 58f8   and.w   r8, r8, #7936   ; 0x1f00
  a4:   0a12        lsrs    r2, r2, #8
  a6:   ea48 0107   orr.w   r1, r8, r7
  aa:   ea41 3142   orr.w   r1, r1, r2, lsl #13
  ae:   f3c2 02c2   ubfx    r2, r2, #3, #3
  b2:   430a        orrs    r2, r1
  b4:   4554        cmp r4, sl
  b6:   f820 2f02   strh.w  r2, [r0, #2]!
  ba:   d1cd        bne.n   58 <blend+0x58>
  bc:   9902        ldr r1, [sp, #8]
  be:   448b        add fp, r1
  c0:   9901        ldr r1, [sp, #4]
  c2:   3301        adds    r3, #1
  c4:   428b        cmp r3, r1
  c6:   fa1f fb8b   uxth.w  fp, fp
  ca:   d006        beq.n   da <blend+0xda>
  cc:   9a00        ldr r2, [sp, #0]
  ce:   9903        ldr r1, [sp, #12]
  d0:   428a        cmp r2, r1
  d2:   4654        mov r4, sl
  d4:   dbb5        blt.n   42 <blend+0x42>
  d6:   46a2        mov sl, r4
  d8:   e7f0        b.n bc <blend+0xbc>
  da:   b006        add sp, #24
  dc:   e8bd 0ff0   ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
  e0:   4770        bx  lr
  e2:   bf00        nop

SIMD disassembly:

    sassembly of section .text.blend:

00000000 <blend>:
   0:   e92d 0ff0   stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
   4:   6846        ldr r6, [r0, #4]
   6:   68c4        ldr r4, [r0, #12]
   8:   b086        sub sp, #24
   a:   199e        adds    r6, r3, r6
   c:   9601        str r6, [sp, #4]
   e:   9200        str r2, [sp, #0]
  10:   68ca        ldr r2, [r1, #12]
  12:   f89d 5038   ldrb.w  r5, [sp, #56]   ; 0x38
  16:   9204        str r2, [sp, #16]
  18:   9a01        ldr r2, [sp, #4]
  1a:   f5c5 7680   rsb r6, r5, #256    ; 0x100
  1e:   4293        cmp r3, r2
  20:   ea46 4505   orr.w   r5, r6, r5, lsl #16
  24:   da5d        bge.n   e2 <blend+0xe2>
  26:   8809        ldrh    r1, [r1, #0]
  28:   6802        ldr r2, [r0, #0]
  2a:   9102        str r1, [sp, #8]
  2c:   fb03 fb01   mul.w   fp, r3, r1
  30:   9900        ldr r1, [sp, #0]
  32:   4411        add r1, r2
  34:   9103        str r1, [sp, #12]
  36:   0052        lsls    r2, r2, #1
  38:   9205        str r2, [sp, #20]
  3a:   9903        ldr r1, [sp, #12]
  3c:   9a00        ldr r2, [sp, #0]
  3e:   428a        cmp r2, r1
  40:   fa1f fb8b   uxth.w  fp, fp
  44:   da4b        bge.n   de <blend+0xde>
  46:   4610        mov r0, r2
  48:   4458        add r0, fp
  4a:   f100 4000   add.w   r0, r0, #2147483648 ; 0x80000000
  4e:   9a04        ldr r2, [sp, #16]
  50:   f8dd a014   ldr.w   sl, [sp, #20]
  54:   3801        subs    r0, #1
  56:   eb02 0040   add.w   r0, r2, r0, lsl #1
  5a:   44a2        add sl, r4
  5c:   f834 2b02   ldrh.w  r2, [r4], #2
  60:   8841        ldrh    r1, [r0, #2]
  62:   f3c2 07c4   ubfx    r7, r2, #3, #5
  66:   f3c1 06c4   ubfx    r6, r1, #3, #5
  6a:   ea46 4707   orr.w   r7, r6, r7, lsl #16
  6e:   fb25 f707   smuad   r7, r5, r7
  72:   f001 0907   and.w   r9, r1, #7
  76:   ea4f 3c51   mov.w   ip, r1, lsr #13
  7a:   f002 0607   and.w   r6, r2, #7
  7e:   ea4f 3852   mov.w   r8, r2, lsr #13
  82:   ea4c 0cc9   orr.w   ip, ip, r9, lsl #3
  86:   ea48 06c6   orr.w   r6, r8, r6, lsl #3
  8a:   ea4c 4606   orr.w   r6, ip, r6, lsl #16
  8e:   fb25 f606   smuad   r6, r5, r6
  92:   f3c1 2104   ubfx    r1, r1, #8, #5
  96:   f3c2 2204   ubfx    r2, r2, #8, #5
  9a:   ea41 4202   orr.w   r2, r1, r2, lsl #16
  9e:   fb25 f202   smuad   r2, r5, r2
  a2:   f3c6 260f   ubfx    r6, r6, #8, #16
  a6:   097f        lsrs    r7, r7, #5
  a8:   f3c6 01c2   ubfx    r1, r6, #3, #3
  ac:   f007 07f8   and.w   r7, r7, #248    ; 0xf8
  b0:   430f        orrs    r7, r1
  b2:   f402 52f8   and.w   r2, r2, #7936   ; 0x1f00
  b6:   ea47 3646   orr.w   r6, r7, r6, lsl #13
  ba:   4316        orrs    r6, r2
  bc:   4554        cmp r4, sl
  be:   f820 6f02   strh.w  r6, [r0, #2]!
  c2:   d1cb        bne.n   5c <blend+0x5c>
  c4:   9902        ldr r1, [sp, #8]
  c6:   448b        add fp, r1
  c8:   9901        ldr r1, [sp, #4]
  ca:   3301        adds    r3, #1
  cc:   428b        cmp r3, r1
  ce:   fa1f fb8b   uxth.w  fp, fp
  d2:   d006        beq.n   e2 <blend+0xe2>
  d4:   9a00        ldr r2, [sp, #0]
  d6:   9903        ldr r1, [sp, #12]
  d8:   428a        cmp r2, r1
  da:   4654        mov r4, sl
  dc:   dbb3        blt.n   46 <blend+0x46>
  de:   46a2        mov sl, r4
  e0:   e7f0        b.n c4 <blend+0xc4>
  e2:   b006        add sp, #24
  e4:   e8bd 0ff0   ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
  e8:   4770        bx  lr
  ea:   bf00        nop

try to load 32-bit at a time instead of 16-bit, then process two pixels in the loop yourself (kinda unroll). — auselen, Aug 21 '14 at 17:39
You can see in disassembly it is `ldrh` instruction instead of `ldr`. my point isn't about unrolling but loading 32-bit. — auselen, Aug 21 '14 at 17:41
this might help you to speed up the algo [link](http://stackoverflow.com/a/23598850/3096593) — francek, Sep 09 '14 at 22:24

jtlim · Answer 1 · 2014-08-21T16:49:52.830

3

If you want to truly optimize a function, you should check the assembler output of the compiler. You'll then learn how it is transforming your code, and then you can learn how to write code to help the compiler produce better output, or write the necessary assembler.

One of the easy wins you'd hit on quickly on your alpha blending loop is that division is slow.

Instead of x / 100, use

x * 65536 / 65536 / 100

-> x * (65536 / 100) / 65536

-> x * 655.36 >> 16

-> x * 656 >> 16

An even better alternative would be to use alpha values between 0 -> 256 so that you can just bitshift the result without even needing to do this trick.

One reason why smuad might not giving any benefit is that you're having to move data into a format specifically for this command.

I'm not sure whether you'll be able to do better in general, but I thought I'd point out a way to avoid the division in your sample routine. Also, if you inspect the assembly, you may discover that there is code generation that you don't expect that can be eliminated.

edited Aug 21 '14 at 16:49

answered Aug 21 '14 at 16:43

jtlim

3,861
1
16
14

Thank you for the hint, I will use 256, but this function is just an example, my goal is to find out why almost all SIMD instructions I've tried so far are slower, and whether I'm doing something wrong or not. – iabdalkader Aug 21 '14 at 16:55
1

I'd recommend most of all to inspect the assembly.. then you know what the CPU is being told to do.. and often, you might find the compiler isn't as smart as you'd hope :) – jtlim Aug 21 '14 at 16:57
The problem with the loop above is the cost of the multiplies is almost certainly being dwarfed by the cost of the divisions, so any improvements you make will not be obviously apparent. – jtlim Aug 21 '14 at 16:58
replaced the div with >>8 it didn't make much difference, saved a few us in the scalar version ... – iabdalkader Aug 21 '14 at 17:15
I'm surprised it didn't make much difference (What % difference is a few us?); but once again, the assembler code is the key. – jtlim Aug 21 '14 at 17:18
about 100-200 us, disassembly added. – iabdalkader Aug 21 '14 at 17:20
same for the SIMD version, I guess this was expected because they both used division before. – iabdalkader Aug 21 '14 at 17:22
Good enough compilers are good at doing static analysis of such arithmetic much better than programmers. So it is not a good advise to make code less readable. – auselen Aug 21 '14 at 17:48
It is good advise to ask to inspect assembly for such small code. – auselen Aug 21 '14 at 17:48

artless noise · Answer 2 · 2014-08-21T18:18:00.007

Community Wiki Answer

The major change to accommodate SIMD type instruction is to transform the loading. The SMUAD instruction can be looked at as a 'C' instruction like,

/* a,b are register vectors/arrays of 16 bits */
SMUAD = a[0] * b[0] + a[1] * b[1];

It is very easy to transform these. Instead of,

    u16 *srcp, dstp;
    uint spix = *srcp++;
    uint dpix = dstp[i+x];

Use the full bus and get 32bits at a time,

    uint *srcp, *dstp;  /* These are two 16 bit values. */
    uint spix = *srcp++;
    uint dpix = dstp[i+x];
    /* scale `dpix` and `spix` by alpha */
    spix /= (alpha << 16 | alpha);     /* Precompute, reduce strength, etc. */
    dpix /= (1-alpha << 16 | 1-alpha); /* Precompute, reduce strength, etc. */
    /* Hint, you can use SMUAD here? or maybe not.  
       You could if you scale at the same time*/

It looks like SMUL is a good fit for the alpha scaling; you don't want to add the two halves.

Now, spix and dpix contain two pixels. The vr synthetic is not needed. You may do two operations at one time.

    uint rb = (dpix + spix) & ~GMASK;  /* GMASK is x6xx6x bits. */
    uint g = (dpix + spix) & GMASK;
    /* Maybe you don't care about overflow?  
       A dual 16bit add helps, if the M4 has it? */

    dstp[i+x]= rb | g;  /* write 32bits or two pixels at a time. */

Mainly just making better use of the BUS by loading 32bits at a time will definitely speed up your routine. Standard 32 bit integer math may work most of the time if you are careful of the ranges and don't overflow the lower 16bit value in to the upper one.

For blitter code, Bit blog and Bit hacks are useful for extraction and manipulation of RGB565 values; whether SIMD or straight Thumb2 code.

Mainly, it is never a simple re-compile to use SIMD. It can be weeks of work to transform an algorithm. If done properly, SIMD speed ups are significant when the algorithm is not memory bandwidth bound and don't involve many conditionals.

It was, but you've done far better here :) Upvoted your answer. — jtlim, Aug 21 '14 at 17:52

jtlim · Accepted Answer · 2014-08-21T17:38:56.593

2

Now with the disassembly posted: You'll see that both the scalar and the simd version have 29 instructions, and the SIMD version actually takes more code space. (Scalar is 0x58 -> 0xba for the inner loop, vs SIMD is (0x5c -> 0xc2))

You can see a lot of instructions are used getting the data into the right format for both loops.. maybe you can improve the performance more by working on the RGB bit unpacking/repacking rather than the alpha blend calculation!

Edit: You may also want to consider processing pairs of pixels at a time.

edited Aug 21 '14 at 17:38

answered Aug 21 '14 at 17:30

jtlim

3,861
1
16
14

but unpacking/packing takes place in both versions, so it cancels out, with SIMDs I expected some improvement, or at least the same performance, each instruction replaces 2 multiplications and an addition. Also tried unrolling as mentioned. – iabdalkader Aug 21 '14 at 17:39
The inner loop has the same number of instructions, so the 2 multiplications and an addition are replaced by *more bit moving and* a smuad instruction. If you can avoid/solve the bit moving, then you might get benefit from the SIMD. – jtlim Aug 21 '14 at 17:42
I see what you mean now, and inspecting the assembly did help a lot, I'm torn between this answer and the other one, they both helped me a lot, but your comments made it clear, so accepting yours and upvoting everyone :) – iabdalkader Aug 21 '14 at 17:57
I'm not sure what your RGB macros are, but I tried compiling the code with my own RGB macros, and the inner loop was just 21 instructions for ARM – jtlim Aug 21 '14 at 18:02

Cortex-M4 SIMD slower than Scalar

3 Answers3