Bit shifting for fixed point arithmetic on float numbers in C

Question

i wrote the following test code to check fixed point arithmetic and bit shifting.

void main(){
    float x = 2;
    float y = 3;
    float z = 1;
    unsigned int * px = (unsigned int *) (& x);
    unsigned int * py = (unsigned int *) (& y);
    unsigned int * pz = (unsigned int *) (& z);
    *px <<= 1;
    *py <<= 1;
    *pz <<= 1;
    *pz =*px + *py;
    *px >>= 1;
    *py >>= 1;
    *pz >>= 1;
    printf("%f %f %f\n",x,y,z);
  }

The result is 2.000000 3.000000 0.000000

Why is the last number 0? I was expecting to see a 5.000000 I want to use some kind of fixed point arithmetic to bypass the use of floating point numbers on an image processing application. Which is the best/easiest/most efficient way to turn my floating point arrays into integers? Is the above "tricking the compiler" a robust workaround? Any suggestions?

When you shifted *px and other by one bit, you erased only sign bit, but not an exponent. There is a [bit format of IEEE float](http://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Float_example.svg/500px-Float_example.svg.png) — osgx, May 30 '12 at 15:52

osgx · Answer 1 · 2012-05-30T15:43:18.523

3

If you want to use fixed point, dont use type 'float' or 'double' because them has internal structure. Floats and Doubles have specific bit for sign; some bits for exponent, some for mantissa (take a look on color image here); so they inherently are floating point.

You should either program fixed point by hand storing data in integer type, or use some fixed-point library (or language extension).

There is a description of Floating point extensions implemented in GCC: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html

There is some MACRO-based manual implementation of fixed-point for C: http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C

edited May 30 '12 at 15:43

answered May 30 '12 at 15:36

osgx

90,338
53
357
513

Unfortunately this is not a choice. I have an application that runs on ARM processor using floats and I have to send things to a DSP for processing. The DSP doesn't have a floating point unit so before sending the data i have to turn them into fixed point. No Floating point extensions are ported. – user1410966 May 30 '12 at 15:45
@user1410966, you can do floating point calculation on ARM, but before you will send data to DSP, you should _manually_ convert floating into fixed. What fixed format can be used on DSP? – osgx May 30 '12 at 15:47
There is a variant of converting: [link](http://stackoverflow.com/a/187823/196561) quote: "`double f = 1.2345; int n; n=(int)(f*65536);`" (if 16:16 fixed point format is needed). – osgx May 30 '12 at 15:49

score 2 · Answer 2 · answered May 30 '12 at 15:58

What you are doing are cruelties to the numbers.

First, you assign values to float variables. How they are stored is system dependant, but normally, IEEE 754 format is used. So your variables internally look like

x = 2.0 = 1 * 2^1   : sign = 0, mantissa = 1,   exponent = 1 -> 0 10000000 00000000000000000000000 = 0x40000000
y = 3.0 = 1.5 * 2^1 : sign = 0, mantissa = 1.5, exponent = 1 -> 0 10000000 10000000000000000000000 = 0x40400000
z = 1.0 = 1 * 2^0   : sign = 0, mantissa = 1,   exponent = 0 -> 0 01111111 00000000000000000000000 = 0x3F800000

If you do some bit shiftng operations on these numbers, you mix up the borders between sign, exponent and mantissa and so anything can, may and will happen.

In your case:

your 2.0 becomes 0x80000000, resulting in -0.0,
your 3.0 becomes 0x80800000, resulting in -1.1754943508222875e-38,
your 1.0 becomes 0x7F000000, resulting in 1.7014118346046923e+38.

The latter you lose by adding -0.0 and -1.1754943508222875e-38, which becomes the latter, namely 0x80800000, which should be, after >>ing it by 1, 3.0 again. I don't know why it isn't, probably because I made a mistake here.

What stays is that you cannot do bit-shifting on floats an expect a reliable result.

I would consider converting them to integer or other fixed-point on the ARM and sending them over the line as they are.

Perfect! That's exactly the information i needed to start with. Which is the safest way to turn them into integers losing as less precision? — user1410966, May 30 '12 at 16:09
Take the largest number you can think of (e. g. 10) and go to the next greatest power of 2 (16). Then pick an integer type (e. g. uint16) and make the value 16 equivalent to 65536 in the "integer space" by applying the factor 4096. This way, for 1 you get 4096, for .25 you get 1024, and for any other value with uncertain precision you get an odd number. Be aware that .1, e.g., is very odd when represented as a float, thus it will give you an odd number in integer space as well. — glglgl, May 30 '12 at 16:22

score 2 · Answer 3 · answered May 30 '12 at 16:06

It's probable that your compiler uses IEEE 754 format for floats, which in bit terms, looks like this:

SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF
^ bit 31                       ^ bit 0

S is the sign bit s = 1 implies the number is negative.

E bits are the exponent. There are 8 exponent bits giving a range of 0 - 255 but the exponent is biased - you need to subtract 127 to get the true exponent.

F bits are the fraction part, however, you need to imagine an invisible 1 on the front so the fraction is always 1.something and all you see are the binary fraction digits.

The number 2 is 1 x 2¹ = 1 x 2^{128 - 127} so is encoded as

01000000000000000000000000000000

So if you use a bit shift to shift it right you get

10000000000000000000000000000000

which by convention is -0 in IEEE754, so rather than multiplying your number by 2 your shift has made it zero.

The number 3 is [1 + 0.5] x 2^{128 - 127}

which is represented as

01000000010000000000000000000000

Shifting that left gives you

10000000100000000000000000000000

which is -1 x 2^-126 or some very small number.

You can do the same for z, but you probably get the idea that shifting just screws up floating point numbers.

score 1 · Answer 4 · answered May 30 '12 at 15:44

1

Fixed point doesn't work that way. What you want to do is something like this:

void main(){
    // initing 8bit fixed point numbers
    unsigned int x = 2 << 8;
    unsigned int y = 3 << 8;
    unsigned int z = 1 << 8;

    // adding two numbers
    unsigned int a = x + y;

    // multiplying two numbers with fixed point adjustment
    unsigned int b = (x * y) >> 8;

    // use numbers
    printf("%d %d\n", a >> 8, b >> 8);
  }

answered May 30 '12 at 15:44

Tobias Schlegel

3,970
18
22

1

Correct. For Integers. But my problem is how to do that with floats. – user1410966 May 30 '12 at 15:50
You cannot do fixed point math with data in floating point representation. You can convert your floats to fixed-point ints and do what I wrote above, or you could emulate floating point arithmetic, but that is probably way to complicated. However, your original code does not treat floating point data in any way that would make sense. – Tobias Schlegel May 30 '12 at 15:56
@user1410966: You should read more to understand fixed, floating point and binary rational number better. Every fixed-point type have some fixed position for decimal point, just shift the mantissa of the float to the correct position so that the integer and decimal parts have correct value – phuclv Sep 25 '13 at 08:42

Bit shifting for fixed point arithmetic on float numbers in C

4 Answers4