pandas.DataFrame.round doesn't seem to work on my DataFrames - Rounding issue causes extra data stored in csv files

Question

I've stumbled upon a small issue when using pandas DataFrame:

I have a big csv file (around 2Gb of data) containing the price of an asset and created using the DataFrame.to_csv() function of Pandas, and when I take a closer inspection of the code, my first lines look like this:

DateTime,open,high,low,close
2016-01-04 00:36:18,1.08505,1.08505,1.08504,1.08504
2016-01-04 00:36:19,1.08505,1.08505,1.08504,1.08504
2016-01-04 00:36:20,1.08503,1.08503,1.08495,1.08495
2016-01-04 00:36:21,1.0849600000000001,1.0849600000000001,1.0849600000000001,1.0849600000000001
2016-01-04 00:36:22,1.0849600000000001,1.0849600000000001,1.08492,1.08492

The data was created using the df.resample('1s').ohlc() and I thought that sometimes there was a few rounding issue, so I tried to round the DataFrame using df.round(5)to keep the last 5 decimals, but it doesn't change anything at all.

SEC = pd.read_csv("D:\Finance python\Data\EUR_USD\Sec\S1_2015.csv",index_col='DateTime',parse_dates=True,error_bad_lines=False,infer_datetime_format=True)
SEC = SEC.round(5)

The DataFrame stays the same, and I truly wonder why.

When I try it with a csv file containing the 5 rows I gave above:

In[13]: SEC["open"][3]
Out[13]: 1.0849599999999999

It's not an issue when doing calculation over the df (even though it might be faster the less decimals there is), but it seems like a lot of 0 or 9 are being stored in my csv files for nothing, and are taking extra space.

It also seems that even value that look fine in the csv file, are actually not well rounded when called with pandas.

Would anyone have an idea of why the DataFrame are not being rounded properly, or of a solution to have shorter csv files when I save them with pandas?

Thanks in advance

Edit: I tried to use the Decimal method, but it still doesn't work. I believe that it is because pandas is not able to store Decimal type numbers in dataframes, thus converting it to a float.

@Wen Did you try it with the rows I provided? I tried with only those 5 rows, and I ecounter the same problem, it doesn't round the values (I added it in the question) — Erlinska, Nov 02 '17 at 03:35

Ken Wei · Answer 1 · 2017-11-02T04:42:39.683

3

This has to do with the precision of floating point arithmetic; not all numbers can be represented exactly. If you want to set pandas to display numbers to 5 decimal places, you can do

pd.set_options('display.float_format','{:.5f}')

but the internal representation will stay the same (which after reading your post closely, will not solve your issues with the csv).

If you want to change the internal representation, you need to use a Decimal data type:

from decimal import Decimal
df.round(5).astype(Decimal)

FYI, you can reproduce your problem like this:

import numpy as np
np.float64(1.08496)

edited Nov 02 '17 at 04:42

answered Nov 02 '17 at 04:34

Ken Wei

3,020
1
10
30

It doesn't work actually, the data stays the same. When I check the type of the data in the df, it stays a float number, not a decimal one, I guess pandas doesn't allow the data in dataframe to have a decimal type. – Erlinska Nov 02 '17 at 22:37
You have to reassign the dataframe e.g. `df = df.round(5).astype(Decimal)`. But if you are only interested in the data that gets written to the csv, then you'll definitely prefer using the `float_format` argument in `.to_csv` (I had completely forgotten about that). – Ken Wei Nov 03 '17 at 01:30
2

I get `TypeError: dtype '' not understood` from `df = df.round(0).astype(Decimal)` – Gathide May 01 '20 at 06:05

score 3 · Answer 2 · answered Nov 02 '17 at 23:34

3

I found what the issue was on another post: float64 with pandas to_csv

I need to use the argument float_format='%.5fto have my csv file as I want them, the issue is linked to the way float number work.

answered Nov 02 '17 at 23:34

Erlinska

433
5
16

score 0 · Answer 3 · answered Nov 02 '17 at 04:33

I try to run your code:

df = pd.read_clipboard(sep=',',engine='python')
df

DateTime    open    high    low close
0   2016-01-04 00:36:18 1.08505 1.08505 1.08504 1.08504
1   2016-01-04 00:36:19 1.08505 1.08505 1.08504 1.08504
2   2016-01-04 00:36:20 1.08503 1.08503 1.08495 1.08495

and then use df.round(3)

    DateTime    open    high    low close
0   2016-01-04 00:36:18 1.085   1.085   1.085   1.085
1   2016-01-04 00:36:19 1.085   1.085   1.085   1.085
2   2016-01-04 00:36:20 1.085   1.085   1.085   1.085

It works for me ,but as I df.astype(str).round(3) and then it will not work .So I recommend you to check the type of your data.

pandas.DataFrame.round doesn't seem to work on my DataFrames - Rounding issue causes extra data stored in csv files

3 Answers3

Linked