How to implement floating-point to fixed-point conversion, fixed-point to floating-point conversion, fixed-point addition/multiplication?
1 Answers
To use fixed-point, then we must implement addition and multiplication operations. In that case, we need to worry about how many bits you have allocated for the fractional part and how many bits allocated for the integer part. And then we can do "shift" operation as your preference.
In the following code-snippet, I've implemented fixed-point by allocating 22 bits for the fractional part and 9 bits for the integer part. (additional bit will be for the sign)
In multiplication, I've first expanded the bit-length of each value to avoid overflow. After multiplication, left shift will happen to keep the same fractional part for the output of multiplication.
In addition, I've added saturation for the output, in order to avoid any overflow (if overflow happens, then output will keep the maximum absolute value that it can keep irrespective of the sign)
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#define fractional_bits 30
#define fixed_type_bits 32
typedef int32_t fixed_type;
typedef int64_t expand_type;
fixed_type float_to_fixed(float inp)
{
return (fixed_type)(inp * (1 << fractional_bits));
}
float fixed_to_float(fixed_type inp)
{
return ((float)inp) / (1 << fractional_bits);
}
fixed_type fixed_mult(fixed_type inp_1, fixed_type inp_2)
{
fixed_type inp_1_sign = inp_1 >> (fixed_type_bits - 1);
fixed_type inp_2_sign = inp_2 >> (fixed_type_bits - 1);
fixed_type mult = (fixed_type)(((expand_type)inp_1 * (expand_type)inp_2) >> fractional_bits);
fixed_type mult_sign = mult >> (fixed_type_bits - 1);
if ((inp_1_sign != inp_2_sign && mult_sign == -1) || (inp_1_sign == 1 && inp_2_sign == 1 && mult_sign == 0) || (inp_1_sign == -1 && inp_2_sign == -1 && mult_sign == 0))
{
return mult;
}
else if ((inp_1_sign != inp_2_sign) && mult_sign == 0)
{
return (1 << (fixed_type_bits - 1));
}
else
{
return ((1 << (fixed_type_bits - 2)) - 1 + (1 << (fixed_type_bits - 2)));
}
}
fixed_type fixed_add(fixed_type inp_1, fixed_type inp_2)
{
fixed_type inp_1_sign = inp_1 >> (fixed_type_bits - 1);
fixed_type inp_2_sign = inp_2 >> (fixed_type_bits - 1);
fixed_type add = inp_1 + inp_2;
fixed_type add_sign = add >> (fixed_type_bits - 1);
if (inp_1_sign != inp_2_sign)
{
return add;
}
else if (add_sign == inp_1_sign)
{
return add;
}
else if (add_sign == -1)
{
return ((1 << (fixed_type_bits - 2)) - 1 + (1 << (fixed_type_bits - 2)));
}
else if (add_sign == 0)
{
return (1 << (fixed_type_bits - 1));
}
}

- 2,449
- 1
- 13
- 20
-
If this overflows: `fixed_type add = inp_1 + inp_2;` you are in undefined-behavior-land, i.e., an optimizing compiler might optimize away all your checks below that. Also, for multiplication you do not check for overflows and always truncate towards zero (which can lead to numerical drift over time). – chtz Jun 21 '20 at 13:37
-
Why is `fixed_add` so complicated? Why not just `return inp_1 + inp_2;`? – chqrlie Jun 21 '20 at 14:02
-
@chqrlie OP intends to implement saturated addition. Here would be a related question on that: https://stackoverflow.com/questions/17580118/signed-saturated-add-of-64-bit-ints – chtz Jun 21 '20 at 22:58
-
@chqrlie it is intended to keep the saturated value if overflow occurs. Consider a real scenario, if any overflow occurs in a mathematical operation, we don't want to return any "overflow" or "undefined" behaviour. Then we need to keep the maximum value that it can keep. (If it's a positive value, then we have to keep the maximum positive value and if it's a negative value, then we have to keep the maximum negative value) – Kavindu Vindika Jun 22 '20 at 06:01
-
1OK, I missed this goal as you, the OP, did not mention it in the question nor in comments. If overflow detection and handling matters, you cannot compute `fixed_type add = inp_1 + inp_2;` before testing because arithmetic overflow has undefined behavior. – chqrlie Jun 22 '20 at 06:47
-
You should rephrase your question and include the code in your response in it. – chqrlie Jun 22 '20 at 06:47
-
yes, indeed. Overflow has undefined behavior, but the user should not give any value that can't be represented in **fixed_type** as inputs of the addition or multiplication operation. Otherwise, it will definitely end up with undefined behavior. (Even with sign integers, this happens because if you try to give a value more than it can represent, then it goes to overflow). But, if any overflow happens because of the addition or multiplication operation, then it will be saturated and avoid the overflow. – Kavindu Vindika Jun 22 '20 at 16:47
-
@K.vindi I don't think you understood our remarks regarding overflow. The result of the operation `inp1 + inp2` is undefined if it overflows. Your implementation might work with your setup, but it is not guaranteed to work. Read the answers to the linked question for more details on that. – chtz Jun 30 '20 at 18:49