ultra small float values on stm32f4

Question

Here is simple code for stm32f4

void main(void)
{
    float sum = 1.0;
    uint32_t cnt = 0;

    while(1)
    {
        for( cnt = 0; cnt < 1000; cnt++ )
            sum += 2.0e-08;

        printfUsart("%f\r\n", 
                            sum
                            );
    }
}

There is no changes of variable sum value. If i summarize in loop this value: sum += 2.0e-07; it increase. I use "gcc-arm-none-eabi-4_9-2014q4" compiler with this compile and linker flags:

PROCESSOR = -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16

So, how work with ultra small float values? I need it to implement Matlab generated code in stm32f4 firmware to realize some filtering functions.

it's [`int main(void)`, not `void main()`](http://stackoverflow.com/questions/204476/what-should-main-return-in-c-and-c) — phuclv, Mar 05 '15 at 17:14
I think void main() is okay, because we have an embedded micro-controller. There is no need of returning a value, because main has no calling thread or OS. — Lui, Mar 06 '15 at 08:31
With floats, definitely use Kahan sum, but also consider multiplying everything by 10^6 or 10^8 during (and before) the loop as well...and then dividing afterwards. That's typically how similar issues are resolved in fixed point. — bunkerdive, Mar 07 '15 at 12:20

Pascal Cuoq · Answer 1 · 2015-03-05T17:28:54.690

The IEEE 754 binary32 floating-point format has 24 bits of precision, which amounts to approximately 7 decimal digits (the correspondance is not exact because binary is not decimal).

This is not enough to distinguish 1 and 1.00000002. The binary32 value immediately above 1.0f is exactly 1.00000011920928955078125.

Your available options are

to use the type double for the variable sum, assuming that double is mapped to IEEE 754 binary64 with its 53 bits of precision, or

to improve accuracy by using a better summation algorithm. The most famous is Kahan's:

void main(void)
{
  float sum = 1.0;
  uint32_t cnt = 0;
  float c = 0;

  while(1)
  {
    for( cnt = 0; cnt < 1000; cnt++ )
    {
        float y = 2.0e-08f - c;
        float t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }

    printfUsart("%f\r\n", sum);
  }
}

score 0 · Answer 2 · answered Mar 06 '15 at 08:25

An other possibility is to optimize your filtering functions regarding to the used coefficients.

Here is the solution for manually optimization of your example:

void main(void)
{
    float sum = 1.0;
    uint32_t cnt = 0, temp = 0;

    while(1)
    {
        for( cnt = 0; cnt < 1000; cnt++ )
            temp += 2;
        sum = sum + temp*e-08;

        printfUsart("%f\r\n", 
                            sum
                            );
    }
}

An disadvantage is, that you have to do this optimization manually and that it is not generic, but it can save a lot of computing time because there are less floatingpoint-operations.

ultra small float values on stm32f4

2 Answers2