2

I have a Pandas dataframe, with 4 rows, and one of the columns (named limit) contains floating point values, where any zeros must be replaced with 9999999999 (9.999999999 billion). The column is set to the float32 data type, and I use the pandas.DataFrame.where method to do the replacement. But it's not working as expected because Numpy is rounding up 9999999999 to 10000000000 (10 billion).

I've tried this in iPython 3 (Python 3.6.8), Pandas version 0.24.2, Numpy version 1.14.0.

This is the replacement statement

df['limit'] = df['limit'].where(df['limit'] != 0, 9999999999)

I'm seeing the following column values for limit:

0    1.000000e+10
1    1.000000e+10
2    1.000000e+10
3    1.000000e+10

but I'm expecting

0    9999999999.0
1    9999999999.0
2    9999999999.0
3    9999999999.0

Why does the rounding up happen? This doesn't happen with plain Python

In [1]: (9.999999999) * 10**9                                                                                                    
Out[1]: 9999999999.0
Mazdak
  • 105,000
  • 18
  • 159
  • 188
srm
  • 548
  • 1
  • 4
  • 19
  • It smells like a floating point issue, see e.g. [this question](https://stackoverflow.com/questions/588004/is-floating-point-math-broken/588014), but someone with proper Pandas/NumPy knowledge should confirm. – Adriaan Apr 24 '19 at 10:12
  • A 32-bit floating point number is not precise enough to represent that number exactly. It is not an issue with Pandas, NumPy or Python, it's a limitation in [single-precision floating-point format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) ([see example in C](https://rextester.com/LFPS36800)). – jdehesa Apr 24 '19 at 10:22

1 Answers1

2

This is simply because int32 is not capable of preserving that number. You can check this by calculating the number of bits needed for demonstrating that number:

In [24]: np.floor(np.log2(9999999999)) + 1
Out[24]: 34.0

As you can see you need at least 34 bits for demonstrating that number. Therefore you should use int64 as a larger data type for representing it.

Even if you test this by putting the number in a series with same data type you'll see the unexpected result (overflow) again:

In [25]: s = pd.Series([9999999999], dtype=pd.np.int32)

In [26]: s
Out[26]: 
0    1410065407
dtype: int32
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • I think they actually wanted to use float32, which cannot hold that number precisely, hence the "rounding". – jdehesa Apr 24 '19 at 10:23