Understanding the Optimization of C code on ARM platform using GCC Compiler

Question

I am doing c programming on ARM where memory footprint and speed both have a very tight constraint. I am using the GSL-2.1 library which has almost all functions in double but my hardware doesn't have floating point on hardware so it is all doing in software. Therefore, it produces extra code size as well as reduces the speed of execution. My processor has 180KB SRAM and 1MB Flash. Now I want to improve speed as well as memory footprint so I looked into the IDE compiler settings and got following settings.

I have read some thread regarding optimization level on GCC but still here is some more settings which I did not get quite well. Can you please elaborate each settings wrt to GCC for ARM Cortex-M processors.

Update: I have randomly checked/unchecked some boxes I did not get any difference in code size.

What is the question? You should decide on the tradeoff you are ready to make between the speed and the size. You can't have both in the same time. — Eugene Sh., Jun 22 '16 at 19:23
Why is this tagged C++ when you are clearly just talking about C? — Jesper Juhl, Jun 22 '16 at 19:30
@Eugene it is not always question of tradeoff. There must be some way which improves both — Rheatey Bash, Jun 22 '16 at 19:53
@Stark Of course. But to some extent. You can't have both *maximally* optimized. You can have them *somewhat* optimized, this is why you have different optimization levels, otherwise it will just optimize everything even without asking you, as there wouldn't be any downsides. — Eugene Sh., Jun 22 '16 at 20:03
Take a `for` loop as an example. To conserve space the fewest number of statements are in the statement block. However, each iteration causes the processor to transfer execution which costs extra performance points. In the case of optimizing for speed, the `for` loop is *unrolled* so more statements are executed per iteration, reducing the number of execution transfers. — Thomas Matthews, Jun 22 '16 at 20:05
There are some optimizations which can reduce size and improve speed. For example, replacing `pow(2.0, x)` with `1 << x`, will reduce the size of the executable (by not calling a function or not needing the function) and increase the execution speed (because there is only 1 efficient instruction to execute). — Thomas Matthews, Jun 22 '16 at 20:07
You may be able to optimize more than the compiler can. For example, instead of floating point, consider using integers and performing [Fixed Point Math](https://en.wikipedia.org/wiki/Fixed-point_arithmetic). — Thomas Matthews, Jun 22 '16 at 20:10
@Thomas Thanks, Definitely it will help to gain some speed. I would modify for loops. And can you tell me about other options sach as placing function and data items on its own section — Rheatey Bash, Jun 22 '16 at 20:10
@Thomas of course fixed point is best option but as I have already described I am using gls library which is written in floating and double data type mostly. For fixed point I have to convert it in fixed point. — Rheatey Bash, Jun 22 '16 at 20:12
Research "Data Driven design" and "Data Cache Optimization". A simple trick is to declare stuff as "static const" so that the compiler can access the data directly rather than making a local copy. — Thomas Matthews, Jun 22 '16 at 20:13
On one of my embedded systems, we used a Fixed Point denominator of 4096 (due to ADC counts). This was the "internal representation". The Fixed Point would be converted during output to User or input From User. Our ARM7 loved the integral data instructions. I also moved data into a structure for better data cache line and that sped things up. — Thomas Matthews, Jun 22 '16 at 20:16
Other than GLS library I can convert all other functions in fixed point. Is there any way to converting gls in fixed point library — Rheatey Bash, Jun 22 '16 at 20:17
I find myself many times battling the compiler on optimization tricks. Print the assembly language generated by the compiler. Measure with an oscilloscope. Many times you'll find that the compiler's optimal code is slightly more efficient that your implementation. A few times, I've noticed the compiler emitted code that doesn't work; so I simplify the high level language. — Thomas Matthews, Jun 22 '16 at 20:18
If you are interested in speed, use a profiler to tell you where things are "slow". If codesize is more a concern, then you need to look at your .map file to see where all you space is being used and work from there. — Michael Dorgan, Jun 22 '16 at 23:07

supercat · Accepted Answer · 2016-06-22T22:44:59.190

When using gcc to write code for embedded systems, it is important to note that unlike many embedded-systems compilers which allow storage may be written as one type and read as another, and will treat integer overflow in at least somewhat predictable fashion, code which relies upon such behaviors is apt to break when compiled with gcc unless compiled using -fno-strict-aliasing and -fwrapv flags. For example, while the authors of the C Standard would have expected that the function

// Assume usmall is an unsigned value half the size of unsigned int
unsigned multiply(usmall x, usmall y) { return x*y; }

should be safe on two's-complement hardware platforms with silent wraparound on overflow, they didn't require compilers to implement it that way (I think they probably didn't expect that anyone who was writing a compiler for such a platform would be so obtuse as to do otherwise unless emulating some other platform). When compiled with gcc, however, that function may have unexpected side-effects.

Likewise, on many compilers, given e.g.

struct widget_header {uint16_t length; uint8_t type_id;};
struct acme_widget {uint16_t length; uint8_t type_id; uint8_t dat[5];};
struct beta_widget {uint16_t length; uint8_t type_id; uint32_t foo;};

a pointer to any of those types could be cast to widget_header; code could then look at the type_id field and cast to a more specific type. Such techniques will not always work on gcc, however; even if a union declaration containing all three types is in scope, gcc will assume that an access to a field of one of those types cannot possibly modify the corresponding field in any other.

A more concrete example to show how gcc treats aliasing:

    struct s1 { int x; };
    struct s2 { int x; };
    union { struct s1 v1; struct s2 v2; } u;

    static int peek(void *p)
    {
      struct s1 *p1 = (struct s1*)p;
      return *(int*)&p1->x;
    }

    static void setToFive(void *p)
    {
      struct s2 *p2 = (struct s2*)p;
      *(int*)(&p2->x) = 5;
    }

    static int test1a(void *p, void *q)
    {
      struct s1 *p1 = (struct s1*)p;
      if (peek(p)!=23) return 0;
      setToFive(q);
      return peek(p);
    }

    int test1(void)
    {
      struct s2 v2 = {23};
      u.v2 = v2;
      return test1a(&u.v1, &u.v2);
    }

ARM gcc 4.8.2 generates

test1:
        movw    r3, #:lower16:u
        movt    r3, #:upper16:u
        movs    r2, #5
        movs    r0, #23
        str     r2, [r3]
        bx      lr

which stores a 5 into "u" and then returns 23 (assuming that the second call to peek will return the same value as the first one, notwithstanding all of the pointer typecasts which should give a pretty clear indication to the compiler that something might possibly be aliased somewhere).

Understanding the Optimization of C code on ARM platform using GCC Compiler

1 Answers1