Optimizing ARM Cortex M3 code

Question

I have a C Function which tries to copy a framebuffer to FSMC RAM.

The functions eats the frame rate of the game loop to 10FPS. I would like to know how to analyze the disassembled function, should I count each instruction cycle ? I want to know where the CPU spend its time, in which part. I'm sure that the algorithm is also a problem, because its O(N^2)

The C Function is:

void LCD_Flip()
{

    u8  i,j;


    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
    LCD_WriteIndex(0x0022);

    for(j=0;j<fbHeight;j++)
    {
        for(i=0;i<240;i++)
        {
            u16 color = frameBuffer[i+j*fbWidth];
            LCD_WriteData(color);

        }
    }

}

Disassembled function:

08000fd0 <LCD_Flip>:
 8000fd0:   b580        push    {r7, lr}
 8000fd2:   b082        sub sp, #8
 8000fd4:   af00        add r7, sp, #0
 8000fd6:   2000        movs    r0, #0
 8000fd8:   2100        movs    r1, #0
 8000fda:   f7ff fde9   bl  8000bb0 <LCD_SetCursor>
 8000fde:   2050        movs    r0, #80 ; 0x50
 8000fe0:   2100        movs    r1, #0
 8000fe2:   f7ff feb5   bl  8000d50 <LCD_WriteRegister>
 8000fe6:   2051        movs    r0, #81 ; 0x51
 8000fe8:   21ef        movs    r1, #239    ; 0xef
 8000fea:   f7ff feb1   bl  8000d50 <LCD_WriteRegister>
 8000fee:   2052        movs    r0, #82 ; 0x52
 8000ff0:   2100        movs    r1, #0
 8000ff2:   f7ff fead   bl  8000d50 <LCD_WriteRegister>
 8000ff6:   2053        movs    r0, #83 ; 0x53
 8000ff8:   f240 113f   movw    r1, #319    ; 0x13f
 8000ffc:   f7ff fea8   bl  8000d50 <LCD_WriteRegister>
 8001000:   2022        movs    r0, #34 ; 0x22
 8001002:   f7ff fe87   bl  8000d14 <LCD_WriteIndex>
 8001006:   2300        movs    r3, #0
 8001008:   71bb        strb    r3, [r7, #6]
 800100a:   e01b        b.n 8001044 <LCD_Flip+0x74>
 800100c:   2300        movs    r3, #0
 800100e:   71fb        strb    r3, [r7, #7]
 8001010:   e012        b.n 8001038 <LCD_Flip+0x68>
 8001012:   79f9        ldrb    r1, [r7, #7]
 8001014:   79ba        ldrb    r2, [r7, #6]
 8001016:   4613        mov r3, r2
 8001018:   011b        lsls    r3, r3, #4
 800101a:   1a9b        subs    r3, r3, r2
 800101c:   011b        lsls    r3, r3, #4
 800101e:   1a9b        subs    r3, r3, r2
 8001020:   18ca        adds    r2, r1, r3
 8001022:   4b0b        ldr r3, [pc, #44]   ; (8001050 <LCD_Flip+0x80>)
 8001024:   f833 3012   ldrh.w  r3, [r3, r2, lsl #1]
 8001028:   80bb        strh    r3, [r7, #4]
 800102a:   88bb        ldrh    r3, [r7, #4]
 800102c:   4618        mov r0, r3
 800102e:   f7ff fe7f   bl  8000d30 <LCD_WriteData>
 8001032:   79fb        ldrb    r3, [r7, #7]
 8001034:   3301        adds    r3, #1
 8001036:   71fb        strb    r3, [r7, #7]
 8001038:   79fb        ldrb    r3, [r7, #7]
 800103a:   2bef        cmp r3, #239    ; 0xef
 800103c:   d9e9        bls.n   8001012 <LCD_Flip+0x42>
 800103e:   79bb        ldrb    r3, [r7, #6]
 8001040:   3301        adds    r3, #1
 8001042:   71bb        strb    r3, [r7, #6]
 8001044:   79bb        ldrb    r3, [r7, #6]
 8001046:   2b63        cmp r3, #99 ; 0x63
 8001048:   d9e0        bls.n   800100c <LCD_Flip+0x3c>
 800104a:   3708        adds    r7, #8
 800104c:   46bd        mov sp, r7
 800104e:   bd80        pop {r7, pc}

Are you trying to copy it to RAM? The function looks like you are printing the buffer to LCD. — Étienne, Apr 23 '14 at 19:26
@Étienne Yea, actually that's what I'm doing through the FSMC controller. — andre_lamothe, Apr 23 '14 at 19:40
I sent you en e-mail. Can you not use DMA to speed-up copies then? — Étienne, Apr 23 '14 at 19:48
@Étienne The DMA is an option, but the problem is according to an STM32 LCD interface application note, the performance won't be that much. http://www.st.com/st-web-ui/static/active/en/resource/technical/document/application_note/CD00201397.pdf — andre_lamothe, Apr 23 '14 at 19:50

francek · Accepted Answer · 2019-07-02T08:06:00.587

Not exactly answering your question, but I see you aspire for fast execution of the loops.

Here are some tips from the book: 'ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design)' http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745

Chapter 5 contains section named 'C looping structures'. Here is the summary of the section:

Writing Loops Efficiently

Use loops that count down to zero. Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free.
Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This will ensure that the loop overhead is only two instructions.
Use do-while loops rather than for loops when you know the loop will iterate at least once. This saves the compiler checking to see if the loop count is zero.
Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache.
Try to arrange that the number of elements in arrays are multiples of four or eight. You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements.

Based on the summary, your inner loop might look as below.

uinsigned int i = 240/4;  // Use unsigned loop counters by default
                          // and the continuation condition i!=0

do
{
    // Unroll important loops to reduce the loop overhead
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
}
while ( i != 0 )  // Use do-while loops rather than for
                  // loops when you know the loop will
                  // iterate at least once

You might want to experiment also with 'pragmas', e.g. :

#pragma Otime

http://www.keil.com/support/man/docs/armcc/armcc_chr1359124989673.htm

#pragma unroll(n)

http://www.keil.com/support/man/docs/armcc/armcc_chr1359124992247.htm

And as it is Cortex-M3 try to find out if MCU hardware gives you chance to arrange the code/data to take advantage of its Harvard architecture (I experienced 30% speed increase).

see here my other answer

Maybe not everything may be applicable in your application (filling a buffer in reverse order). I just wanted to draw your attention to the book and possible points for optimization.

Do ARM hardware prefetchers work just as well when you traverse arrays in descending order? On some x86 uarches, looping upwards in memory (starting with lower addresses) can be somewhat faster. You can still count towards zero by using `for (i=-size ; i != 0 ; ++i) { sum += arr[size + i]; }` (i.e. index from the end of the array with negative indices that count up towards zero.) — Peter Cordes, Sep 27 '17 at 22:28
I don't know if ARM is faster in descending order but interesting point. (btw isn't your loop traversing the array in accessing order ?) — francek, Nov 03 '17 at 22:09
In my loop, `i` increases, so the address of `arr[size+i]` goes in ascending order. I'm not sure I understand your question. — Peter Cordes, Nov 03 '17 at 22:18
Yes, exactly. `i` indexes relative to `(arr+size)`, which is the end of the array. So in asm, you'd have `arr+size` in a register and a counter in another register counting upward towards zero, using a 2-register indexed addressing mode. I guess in C I should have written `(arr+size)[i]`. — Peter Cordes, Nov 03 '17 at 22:31
When the loop is executed for the 1st time `i = -size` and the array is indexed from its beginning : `arr[size + (-size)] = arr[0]`.I thought you meant this loop `for (i = 1 ; i == size; ++i) { sum += arr[size - i]; }`. — francek, Nov 03 '17 at 22:54
Yes, exactly. You reach the beginning of the array by indexing from the end. So you loop with `adds r0, #4` / `bnz` (if I have the ARM syntax / mnemonics right), without needing a `cmp` instruction. If complex addressing modes have any downsides (like they do on x86), it can be better to increment a pointer and compare against a pointer-to-the-end. (like C++ iterator style, from `.begin()` to `.end()`) — Peter Cordes, Nov 03 '17 at 23:03

score 3 · Answer 2 · answered Apr 23 '14 at 19:36

3

You should start by compiling the C code with speed optimizations enabled. The disassembled code you provide appears to be storing the i and j counters on the stack, which adds 3 load/store operations to the inner loop. You might also want to inline LCD_WriteData in the inner loop.

On the other hand, if you are really writing to the LCD in the inner loop then the performance may be limited by that interface.

answered Apr 23 '14 at 19:36

the maximum update rate is 30Hz that I can write to the LCD using the FSMC memory controller. I'm interesting to know how many cycles that function took, should I count the instruction cycles manually ? – andre_lamothe Apr 23 '14 at 19:42
Counting instructions is a start, but you have to be aware that the number of clocks needed to execute a branch will vary on how long it takes to refill the pipeline. Also, loads and stores have some quirks that can affect timing. If `LCD_WriteData` does any pushes or pops you will need to consider the number of registers pushed or popped. – Apr 23 '14 at 19:50
so the nested loop which is O(N^2) from your point of view, doesn't eat that much from the CPU timing? – andre_lamothe Apr 23 '14 at 19:54
Of course it does. I thought your question was about how to _measure_ or _estimate_ the delay, and counting instruction cycles is one way to do that. If just want to make the code faster, concentrate on the inner loop. Start by letting the compiler do as much as it can and then work from there. – Apr 23 '14 at 20:10

score 1 · Answer 3 · answered May 02 '14 at 18:57

Just to purely reduce the number of looped operations, you could do something like so. I did make some assumptions which may not be accurate: You had a loop that went from i=0:239, and I am assuming that fbWidth is the same as 240. If this isn't true then the loop would have to be more complicated.

void LCD_Flip()
{
    u16 i,limit = fbHeight+fbWidth;
    // We will use a precalculated limit and one single loop

    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
    LCD_WriteIndex(0x0022);

    // Single loop from 0:limit-1 takes care of having to do an
    // x,y conversion each iteration.
    for(i=0;i<limit;j++)
    {
        u16 color = frameBuffer[i];
        LCD_WriteData(color);
    }
}

This strips out the two loops in favor of a single for loop with only one conditional test per iteration. On top of that, the indexing into frameBuffer is now linear, so we don't need to multiply out the width to go from x,y to linear storage. Your loop iterations won't have been reduced (i.e. it is still O(N) with N = height*width), but the number of instructions should have been reduced.

As @Joe Hass noted in his answer, this may not actually help at all if you are really limited by the LCD interface. Depending on which STM32 you're using, the FSMC may not be particularly fast, and I can't imagine the LCD controller would be very fast either.

I have converted the complexity to be linear. but still I'm getting very slow FPS when writing directly to the FSMC. I'm not sure it's a clock problem or what. — andre_lamothe, May 04 '14 at 13:39
I wouldn't expect a huge increase from what I posted. I'd check your FSMC configuration and see if it can be clocked faster. I've never used it to drive an LCD, so I'm not certain how fast the controllers tend to go. I don't know if it helps, but there are some STM32 variants that have an LCD controller built in, rather than using the FSMC. — rjp, May 05 '14 at 13:32

score 0 · Answer 4 · answered Jun 15 '22 at 21:33

0

I think it is possible to use FSMC and DMA. Or just remove that "LCD_WriteData()" call.

To write to data LCD: *(__IO unsigned char *)(LCD_DAT_ADDR)=d;

To read data from LCD: d=*(__IO unsigned char *)(LCD_DAT_ADDR);

answered Jun 15 '22 at 21:33

Kauno Medis

21
5

Optimizing ARM Cortex M3 code

4 Answers4

Linked