How to perform a bitwise operation on floating point numbers

Question

I tried this:

float a = 1.4123;
a = a & (1 << 3);

I get a compiler error saying that the operand of & cannot be of type float.

When I do:

float a = 1.4123;
a = (int)a & (1 << 3);

I get the program running. The only thing is that the bitwise operation is done on the integer representation of the number obtained after rounding off.

The following is also not allowed.

float a = 1.4123;
a = (void*)a & (1 << 3);

I don't understand why int can be cast to void* but not float.

I am doing this to solve the problem described in Stack Overflow question How to solve linear equations using a genetic algorithm?.

What kind of bitwise operation are you attempting? Do you want to work with the IEEE 754 representation of a particular value? — Adam Goode, Nov 12 '09 at 16:41
yes, i want to use whatever binary representation is used by the implementation — Rohit Banga, Nov 12 '09 at 16:45
Incidentally, `a = a & (1<<3)` will clear all of the bits in `a` except for the 3rd one, which is usually not what you want in a genetic algorithm. To clear a single bit, you would want to use the twos-complement operator and say something like `a = a & ~(1<<3)`. — mob, Nov 12 '09 at 17:30
that was just an example, i have a much more complex equation. — Rohit Banga, Nov 12 '09 at 17:31
@iamrohitbanga: Equation??? There's no meaningful "equation" in C++ that would require a bitwise operation on a floating-point type. — AnT stands with Russia, Nov 12 '09 at 19:34
Doesn't change a thing; there's also no meaningful expression in C++ requiring bitwise ops on floats. — MSalters, Nov 13 '09 at 09:33
a genetic algorithm requires that a floating point number expressed in bits move towards a better bitwise representation. right? — Rohit Banga, Nov 14 '09 at 04:46
@MSalters you could use XOR on two floating point numbers to swap their values quickly. — Patrick Roberts, Mar 05 '15 at 18:27
@PatrickRoberts: That's in fact not fast at all on CPU's made in the last decade. It introduces a virtual dependency which interferes with register allocations. Use `std::swap`. — MSalters, Mar 05 '15 at 22:55
@MSalters That's a fair argument, but it's still a meaningful expression requiring bitwise operations on floating point numbers nonetheless. — Patrick Roberts, Mar 06 '15 at 21:25

score 92 · Accepted Answer · edited Oct 25 '20 at 11:56

92

At the language level, there's no such thing as "bitwise operation on floating-point numbers". Bitwise operations in C/C++ work on value-representation of a number. And the value-representation of floating point numbers is not defined in C/C++ (unsigned integers are an exception in this regard, as their shift is defined as-if they are stored in 2's complement). Floating point numbers don't have bits at the level of value-representation, which is why you can't apply bitwise operations to them.

All you can do is analyze the bit content of the raw memory occupied by the floating-point number. For that you need to either use a union as suggested below or (equivalently, and only in C++) reinterpret the floating-point object as an array of unsigned char objects, as in

float f = 5;
unsigned char *c = reinterpret_cast<unsigned char *>(&f);
// inspect memory from c[0] to c[sizeof f - 1]

And please, don't try to reinterpret a float object as an int object, as other answers suggest. That doesn't make much sense, and is not guaranteed to work in compilers that follow strict-aliasing rules in optimization. The correct way to inspect memory content in C++ is by reinterpreting it as an array of [signed/unsigned] char.

Also note that you technically aren't guaranteed that floating-point representation on your system is IEEE754 (although in practice it is unless you explicitly allow it not to be, and then only with respect to -0.0, ±infinity and NaN).

edited Oct 25 '20 at 11:56

Erkin Alp Güney

218
6
15

answered Nov 12 '09 at 17:25

AnT stands with Russia

312,472
42
525
765

17

Votes :) C and C++ languages are like math. The correctness of a formal statement is defined by hard facts and hard proofs, not by consensus of the majority. Majority (votes) doesn't matter. – AnT stands with Russia Nov 12 '09 at 17:39
6

@Chap: You are confused. The diffewrence with `int` is *huge*. The size of `char` in machine bytes is system dependent, but the size of `char` at the language level is not. Size of `char` is always 1 at the language level, meaning that every other type's size is divisible by size of `char`. Additionally, `unsigned char` has no padding bits in it and all combinations of bits are valid. You can't say that about `int`. This all is why every object in C++ can be reinterpreted as an array of `char`s, but can't be reinterpreted as an [array of] `int`. – AnT stands with Russia Nov 12 '09 at 18:41
3

@Chap: What you are saying about system-dependent representation of `float` is true, but that's exactly the point of my answer. As I said, you can only inspect *raw memory* representation of a `float` object, which is synonymous with it being "system-dependent". The point is that if the OP *wants/needs* to inspect the raw memory representation of `float` for some reason, then that the way to do it. – AnT stands with Russia Nov 12 '09 at 18:44
2

The IEEE has some floating point standards that are making floating-point numbers much more uniform. They still don't lend themselves to casual bitwise operations. – David Thornley Nov 12 '09 at 18:47
4

@Chap: There's a difference between doing something *implementation-defined* in C and something *undefined* in C. When I need to do something implementation-defined I would still prefer to: 1) keep system-dependency to a minimum, 2) if possible, avoid relying on undefined behavior. This is what makes `unsigned char` array solution better than an `int` solution. – AnT stands with Russia Nov 12 '09 at 19:52
1

@Chap: In short, even if you have to do something system-dependent, it is still not an excuse to invoke undefined behavior. Unless you have a *very very very* good reason, relying on UB is best avoided. Especially if there's an obvious alternative solution without any UB. – AnT stands with Russia Nov 12 '09 at 19:55
1

What is a value-representation? – Asad Saeeduddin Nov 11 '13 at 03:27
3

@Asad: The concept is defined in the C++ standard (3.9/4 of C++03, for example). Each object type has *object representation* and *value representation*. *Object representation* is the raw memory layout of the object, which includes both value-forming bits and padding bits. *Value representation* applies only to the value-forming bits and describes how these bits encode the target value. – AnT stands with Russia Nov 11 '13 at 04:38
1

Thanks! So just to see if I'm understanding this correctly, the reason you say this: "Floating point numbers don't have bits at the level of value-representation", is because floating point numbers consist of two distinct parts, i.e. the mantissa and the exponent (each of which have a separate value representation)? – Asad Saeeduddin Nov 11 '13 at 04:42
Actually there is a case where bitwise XOR for floats can make perfect sense, and this is the old trick for exchanging values between variables: say i want to do tmp=x1; x1=x2; x2=tmp This can be done faster as x1^=x2;x2^=x1;x1^=x2 – ntg Aug 04 '15 at 15:43
2

@ntg: Firstly, the XOR version is not faster. It is actually slower. (But this strange myth refuses to die for some reason.) Secondly, if one really wants to do it that way, it can (and should) be done by reinterpreting memory occupied by these values as integers. – AnT stands with Russia Aug 04 '15 at 16:49
@AnT: You, Sir, are right: "On modern CPU architectures, the XOR technique can be slower than using a temporary variable to do swapping." I live, therefore I learn ;) https://en.wikipedia.org/wiki/XOR_swap_algorithm#Reasons_for_avoidance_in_practice – ntg Aug 04 '15 at 17:07
1

What if I want a function to create pseudo-half-precision floating numbers by setting half of the significant digits to 0? Then I would ideally need a bitwise operation, no? – Aaron Franke Mar 20 '18 at 07:27
Wouldn't it make sense here to use a char8_t, so that we can rely on the representation of the char we cast to? – Valentin Metz May 18 '22 at 03:49

score 19 · Answer 2 · edited Feb 22 '12 at 20:47

19

If you are trying to change the bits in the floating-point representation, you could do something like this:

union fp_bit_twiddler {
    float f;
    int i;
} q;
q.f = a;
q.i &= (1 << 3);
a = q.f;

As AndreyT notes, accessing a union like this invokes undefined behavior, and the compiler could grow arms and strangle you. Do what he suggests instead.

edited Feb 22 '12 at 20:47

Phil Miller

36,389
13
67
90

answered Nov 12 '09 at 16:42

mob

117,087
18
149
283

10

Technically, this is undefined behavior. You can only access the member of a union that you last wrote to. – KeithB Nov 12 '09 at 16:57
1

Is it good practice to include a compile-time assert that the `float` and `int` have the same size? – Josh Lee Nov 12 '09 at 16:59
@KeithB: really, is it compiler dependent. does the standard not say, the bitwise representation would be the same. – Rohit Banga Nov 12 '09 at 17:00
Good points about making sure `int` and `float` are the same size. Tim Schaeffer's solution is more portable. – mob Nov 12 '09 at 17:11
@mobrule should i change my answer – Rohit Banga Nov 12 '09 at 17:32
any how i found that swapping random bits in the actual floating point representation does not lead to good chromosomes as already mentioned in the post quoted in my question. nevertheless, this union thing didn't strike me. so thanks. – Rohit Banga Nov 12 '09 at 17:34
@KeithB, I though using a union like this was fine in C99? Maybe I misunderstand your point. Do you have a link describing what you mean? – Z boson Nov 25 '15 at 13:53
@Z boson I've been reading over and over again that reading from a union member which isn't the last written to is UB. I'm a bit confused about this as I keep on hearing there's no guarantees in theory, but in reality they occupy the same address, so I'm really confused about this. – Zebrafish Dec 03 '17 at 06:07

score 10 · Answer 3 · answered Sep 20 '19 at 16:31

10

You can work around the strict-aliasing rule and perform bitwise operations on a float type-punned as an uint32_t (if your implementation defines it, which most do) without undefined behavior by using memcpy():

float a = 1.4123f;
uint32_t b;

std::memcpy(&b, &a, 4);
// perform bitwise operation
b &= 1u << 3;
std::memcpy(&a, &b, 4);

answered Sep 20 '19 at 16:31

Patrick Roberts

49,224
10
102
153

This should be the recommended method when using the C language. – Nicholas Kinar Oct 13 '19 at 20:12
2

@NicholasKinar until the C++20 proposal for type punning with [`bit_cast`](https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8#c20-and-bit_cast) is standardized, I don't know of a better (or equally correct) solution in C++ either, other than side-stepping the issue and using `unsigned char*` like the accepted answer. – Patrick Roberts Oct 13 '19 at 20:37
After testing this, I can attest that the use of memcpy() works well for both C and C++. The method is clean, elegant and self-explanatory. The C++20 proposal for type punning will also be a useful addition to the language, but I can imagine using the memcpy() method for some time with C in the context of embedded systems programming. – Nicholas Kinar Oct 14 '19 at 18:34
1

Great answer, however for the sake of clarity I would use `sizeof(uint32_t)` instead of just 4. – Raleigh L. Dec 10 '21 at 19:29

Chap · Answer 4 · 2009-11-12T18:41:32.087

8

float a = 1.4123;
unsigned int* inta = reinterpret_cast<unsigned int*>(&a);
*inta = *inta & (1 << 3);

edited Nov 12 '09 at 18:41

answered Nov 12 '09 at 16:39

Chap

2,776
24
31

3

Or a little less verbose: `reinterpret_cast(a) &= (1 << 3)` – Aaron Nov 12 '09 at 16:41
2

Why not just `(int*)(void*)&a` ? – Cecil Has a Name Nov 12 '09 at 16:52
1

@Cecil Has a Name: using c++ casts – Chap Nov 12 '09 at 16:55
5

C++ casts (XXX_cast<>) are preferred because 1) they are easier to search for, and 2) reinterpret_cast makes it clear that you are doing something system dependent, and potentially dangerous. – KeithB Nov 12 '09 at 17:00
You should isolate this operation in a class for system specific operations since it is heavily dependent on the target system – Chap Nov 12 '09 at 18:45
2

Dereferencing pointer `inta` leads is Undefined Behavior (see strict aliasing). So this approach does not work. – user7860670 Sep 20 '19 at 07:48

score 5 · Answer 5 · answered Nov 13 '09 at 11:02

5

Have a look at the following. Inspired by fast inverse square root:

#include <iostream>
using namespace std;

int main()
{
    float x, td = 2.0;
    int ti = *(int*) &td;
    cout << "Cast int: " << ti << endl;
    ti = ti>>4;
    x = *(float*) &ti;
    cout << "Recast float: " << x << endl;
    return 0; 
}

answered Nov 13 '09 at 11:02

Justin

385
1
3

1

Dereferencing pointer `(int*) &td` as well as `(float*) &ti` leads is Undefined Behavior (see strict aliasing). So this approach does not work. – user7860670 Sep 20 '19 at 07:47

score 3 · Answer 6 · edited Sep 15 '19 at 07:25

FWIW, there is a real use case for bit-wise operations on floating point (I just ran into it recently) - shaders written for OpenGL implementations that only support older versions of GLSL (1.2 and earlier did not have support for bit-wise operators), and where there would be loss of precision if the floats were converted to ints.

The bit-wise operations can be implemented on floating point numbers using remainders (modulo) and inequality checks. For example:

float A = 0.625; //value to check; ie, 160/256
float mask = 0.25; //bit to check; ie, 1/4
bool result = (mod(A, 2.0 * mask) >= mask); //non-zero if bit 0.25 is on in A

The above assumes that A is between [0..1) and that there is only one "bit" in mask to check, but it could be generalized for more complex cases.

This idea is based on some of the info found in is-it-possible-to-implement-bitwise-operators-using-integer-arithmetic

If there is not even a built-in mod function, then that can also be implemented fairly easily. For example:

float mod(float num, float den)
{
    return num - den * floor(num / den);
}

score 2 · Answer 7 · answered Nov 12 '09 at 16:56

2

@mobrule:

Better:

#include <stdint.h>
...
union fp_bit_twiddler {
    float f;
    uint32_t u;
} q;

/* mutatis mutandis ... */

For these values int will likely be ok, but generally, you should use unsigned ints for bit shifting to avoid the effects of arithmetic shifts. And the uint32_t will work even on systems whose ints are not 32 bits.

answered Nov 12 '09 at 16:56

Tim Schaeffer

2,616
1
16
20

3

Of course, this still won't work for systems whose floats are not 32 bits. – AnT stands with Russia Nov 12 '09 at 17:27
2

Floating point numbers nowadays usually follow the IEEE standards, so floats are usually 32 bits and doubles usually 64. There have got to be exceptions out there, but I haven't encountered them. However, assert(sizeof(float)==sizeof(uint32_t)); is easy to write. – David Thornley Nov 12 '09 at 18:48
1

Accessing integer member of the union after assigning float member leads to Undefined Behavior. So this approach does not work. – user7860670 Sep 20 '19 at 07:46

score 1 · Answer 8 · edited Feb 06 '12 at 19:26

The Python implementation in Floating point bitwise operations (Python recipe) of floating point bitwise operations works by representing numbers in binary that extends infinitely to the left as well as to the right from the fractional point. Because floating point numbers have a signed zero on most architectures it uses ones' complement for representing negative numbers (well, actually it just pretends to do so and uses a few tricks to achieve the appearance).

I'm sure it can be adapted to work in C++, but care must be taken so as to not let the right shifts overflow when equalizing the exponents.

score 1 · Answer 9 · answered Feb 08 '12 at 11:53

Bitwise operators should NOT be used on floats, as floats are hardware specific, regardless of similarity on what ever hardware you might have. Which project/job do you want to risk on "well it worked on my machine"? Instead, for C++, you can get a similar "feel" for the bit shift operators by overloading the stream operator on an "object" wrapper for a float:

// Simple object wrapper for float type as templates want classes.
class Float
{
float m_f;
public:
    Float( const float & f )
    : m_f( f )
    {
    }

    operator float() const
    {
        return m_f;
    }
};

float operator>>( const Float & left, int right )
{
    float temp = left;
    for( right; right > 0; --right )
    {
        temp /= 2.0f;
    }
    return temp;
}

float operator<<( const Float & left, int right )
{
    float temp = left;
    for( right; right > 0; --right )
    {
        temp *= 2.0f;
    }
    return temp;
}

int main( int argc, char ** argv )
{
    int a1 = 40 >> 2; 
    int a2 = 40 << 2;
    int a3 = 13 >> 2;
    int a4 = 256 >> 2;
    int a5 = 255 >> 2;

    float f1 = Float( 40.0f ) >> 2; 
    float f2 = Float( 40.0f ) << 2;
    float f3 = Float( 13.0f ) >> 2;
    float f4 = Float( 256.0f ) >> 2;
    float f5 = Float( 255.0f ) >> 2;
}

You will have a remainder, which you can throw away based on your desired implementation.

IDK if all compilers will change the divide into a multiply by `0.5f`. It would be better (for performance reasons) to write it that way, to make sure you never get an FP div when you don't need one. — Peter Cordes, Jun 26 '15 at 02:37

How to perform a bitwise operation on floating point numbers

9 Answers9

Linked

Related