2

I'd like to use dtype='float32' (it is probably a numpy dtype => np.float32) instead of dtype='float64' to reduce memory usage of my pandas dataframe, because I have to handle hugh pandas dataframes.

At one point, I'd like to extract a python list with '.to_dict(orient='records')' in order to get a dictionary for each row.

In this case, I will get additional decimal places, which are probably based on s.th like this:

Is floating point math broken?

How can I cast the date / change the type etc. in order to get the same result, as I get with float64 (see example snippets)?

import pandas as pd

_data = {'col1': [1.45123, 1.64123], 'col2': [0.1, 0.2]}

_test = pd.DataFrame(_data).astype(dtype='float64')

print(f"{_test=}")
print(f"{_test.round(1)=}")
print(f"{_test.to_dict(orient='records')=}")
print(f"{_test.round(1).to_dict(orient='records')=}")

float64 output:


_test=      col1  col2
0  1.45123   0.1
1  1.64123   0.2
_test.round(1)=   col1  col2
0   1.5   0.1
1   1.6   0.2
_test.to_dict(orient='records')=[{'col1': 1.45123, 'col2': 0.1}, {'col1': 1.64123, 'col2': 0.2}]
_test.round(1).to_dict(orient='records')=[{'col1': 1.5, 'col2': 0.1}, {'col1': 1.6, 'col2': 0.2}]
import pandas as pd

_data = {'col1': [1.45123, 1.64123], 'col2': [0.1, 0.2]}

_test = pd.DataFrame(_data).astype(dtype='float32')

print(f"{_test=}")
print(f"{_test.round(1)=}")
print(f"{_test.to_dict(orient='records')=}")
print(f"{_test.round(1).to_dict(orient='records')=}")

float32 output:

_test=      col1  col2
0  1.45123   0.1
1  1.64123   0.2
_test.round(1)=   col1  col2
0   1.5   0.1
1   1.6   0.2
_test.to_dict(orient='records')=[{'col1': 1.4512300491333008, 'col2': 0.10000000149011612}, {'col1': 1.6412299871444702, 'col2': 0.20000000298023224}]
_test.round(1).to_dict(orient='records')=[{'col1': 1.5, 'col2': 0.10000000149011612}, {'col1': 1.600000023841858, 'col2': 0.20000000298023224}]
Rene
  • 976
  • 1
  • 13
  • 25
  • `float32` and `float64` are `numpy` constructs, and by extension `pandas`. Double check this, but I bet your `dict` contains Python floats, which are essentially `float64`. Without examining the dataframe in detail, my guess is that conversion to `dict` will increase the memory use, regardless of the dataframe `dtype`. – hpaulj May 20 '22 at 20:30
  • @hpaulj I've also tried to create a numpy array with type `float32` with one decimal place but nothing change. Can you explain your idea? – Rene May 23 '22 at 13:42
  • A float that displays as `12.3` does not take up any less memory than a `12.3000234`. - 4 or 8 bytes depending on dtype. When converted to a string the extra decimals do make a difference. But we don't use strings to save memory. – hpaulj May 23 '22 at 15:15

1 Answers1

2

Managing float representation has some limitation for example this

Using to_dict() function switch from numpy representation to python native float representation, this means a sort of translation. Nevertheless the precision you are using, some small pieces of information will be lost.

For a no-lossy convertion you must cast your number to string before the to_dict() using the as_type() function:

_data = {'col1': [1.45123, 1.64123], 'col2': [0.1, 0.2]}
_test = pd.DataFrame(_data).astype(dtype='float32')
_test.round(1).astype('str').to_dict(orient='records')
_test.round(1).astype('str').to_dict(orient='records')=[{'col1': '1.5', 'col2': '0.1'}, {'col1': '1.6', 'col2': '0.2'}]

An alternative can be the decimal format.

Rene
  • 976
  • 1
  • 13
  • 25
Glauco
  • 1,385
  • 2
  • 10
  • 20