2

The issue

I have a csv file containing large integer values that I want to perform some arithmetic operations on, these fields might contain nan values, now when I load these into a df using the pandas to_csv method, when there are no nan values present, these are loaded as 'int' and the precision seems to be correct, but when nan values are present, these get converted to 'float' and I see a precision loss.

The sample csv file ->

,epoch_1,epoch_2
0,1665045912937687151,1665045912937689151
1,,

After loading ->

[1] df = pd.read_csv('sample.csv', index_col=0)

[2] df
        epoch_1       epoch_2
0  1.665046e+18  1.665046e+18
1           NaN           NaN

[3] df['diff'] = df['epoch_2'] - df['epoch_1']

[4] df
        epoch_1       epoch_2    diff
0  1.665046e+18  1.665046e+18  2048.0
1           NaN           NaN     NaN

As you can see the 3rd column has an incorrect value, the correct value should be 2000.

If there are no nan values the result computed is correct.

What I've tried

I've tried specifying the dtype as Int64 while loading the data

[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': pd.Int64Dtype(), 'epoch_2': pd.Int64Dtype()})

[2] df
               epoch_1              epoch_2
0  1665045912937687296  1665045912937689088
1                 <NA>                 <NA>

[3] df['diff'] = df['epoch_2'] - df['epoch_1']

[4] df
               epoch_1              epoch_2  diff
0  1665045912937687296  1665045912937689088  1792
1                 <NA>                 <NA>  <NA>

As you can see this also results in precision loss and inturn an incorrect result.

Workaround I don't want to use

What I can do is load the data as str, remove the NaN columns, and then convert these fields to 'int64' and calculate the result, this gives correct result:

[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': str, 'epoch_2': str})

[2] df
               epoch_1              epoch_2
0  1665045912937687151  1665045912937689151
1                  NaN                  NaN

[3] df = df[~df['epoch_1'].isna()]

[4] df['diff'] = df['epoch_2'].astype(int) - df['epoch_1'].astype(int)

[5] df
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000

But I need to keep the entries with nan values in the final df, so will have to add those entries back, this method spends a lot of computation just between doing conversions and will be a bottleneck when the size of df & number of fields to compute increase, it also isn't very elegant, so I'm looking for a better way to achieve this.

UPDATE

Another thing that seems to work:-

[1] df = pd.read_csv('sample.csv', index_col=0, dtype=str)

[2] df
               epoch_1              epoch_2
0  1665045912937687151  1665045912937689151
1                  NaN                  NaN

[3] df['diff'] = df['epoch_2'].astype('Int64') - df['epoch_1'].astype('Int64')

[4] df
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000
1                  NaN                  NaN  <NA>

It seems better than dropping na values and adding them again, though this too requires type conversion before operations which I would like to avoid if possible.

This also raises another doubt that why does it lose precision when specifying the dtype for the columns as Int64 in read_csv, but works correctly when loaded as str and then converting to Int64, does read_csv internally load data as float64 and then converts it to the specified dtype?

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
K S
  • 21
  • 3
  • Does this answer your question? [read\_csv using dtypes but there is na value in columns](https://stackoverflow.com/questions/52002271/read-csv-using-dtypes-but-there-is-na-value-in-columns) – Firefighting Physicist Oct 17 '22 at 08:53
  • https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value – Firefighting Physicist Oct 17 '22 at 08:55
  • @FirefightingPhysicist not exactly what I'm looking for, these all load data into one type and use some workaround to convert it to an integer type, which I'm trying to avoid, I want a solution where I don't need to do type conversions after loading the data, if possible, alas thanks for the help. – K S Oct 17 '22 at 09:02
  • Have you tried `df = pd.read_csv('./file.csv', dtype='Int64')` as the links suggest? – Firefighting Physicist Oct 17 '22 at 09:08
  • @FirefightingPhysicist Yes, tried that, it gives incorrect result (1792, instead of 2000) – K S Oct 17 '22 at 09:13
  • indeed strange... it is independent of read_csv as `pd.Series([1665045912937689151, np.nan], dtype='Int64')` is also wrong – Firefighting Physicist Oct 17 '22 at 09:29
  • @KS pandas `read_csv` will auto-detect data type, may not pick the accuracy you need. Start by specifying data types for your columns. dtype='Int64'. – dank8 Mar 03 '23 at 03:47

4 Answers4

2

yes, it is unfortunate that pandas is not natively supporting its new extension dtypes (like the nullable integer arrays), yet. The work to be done is tracked in https://github.com/pandas-dev/pandas/issues/29752 .

The relevant update for pd.read_csv has just landed in main, ie refer to https://github.com/pandas-dev/pandas/pull/48776 and is scheduled for the next pandas release 1.6.0. (EDIT: The new version due in December has been renamed to 2.0.0, recently).

You can already test it with the nightly scipy wheels.

mamba create -n test_pandas -c conda-forge python pandas pip
mamba activate test_pandas
pip install --pre --upgrade --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas
In [5]: pd.__version__
Out[5]: '1.6.0.dev0+350.g2f7dce4e6e'

In [6]: pd.read_csv("sample.csv", use_nullable_dtypes=True, index_col=0).assign(diff=lambda df: df.epoch_2 - df.epoch_1)
Out[6]:
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000
1                 <NA>                 <NA>  <NA>
Jonas Hörsch
  • 473
  • 3
  • 9
1

Very interesting, and also very strange. What I came up with is a turnaround that preserves NaN values

def diff(x,y):
    if math.isnan(float(x)) or math.isnan(float(y)):
        return np.nan
    else:
        z = np.int64(y)- np.int64(x)
        return z

df['diff'] = df.apply(lambda x: diff(x['epoch_1'],x['epoch_2']), axis=1)
imburningbabe
  • 742
  • 1
  • 2
  • 13
  • thanks a lot for your answer, but if you read the update that I've just added this is achieved by converting to Int64 before applying arithmetic operations ( which I guess is internally doing what you are doint, since it is a wrapper for the int64 data type) and I think it might have a smaller overhead in terms of performance, also do you have any idea why simply loading the data as Int64 instead of loading as string and converting to Int64 doesn't work. – K S Oct 17 '22 at 09:12
  • I tried with lambda func because Python doesn't let me perform the `.astype('Int64')` if I have `NaN`s – imburningbabe Oct 17 '22 at 09:16
1

Interesting that df = pd.read_csv('./file.csv', dtype='Int64') does not work in this case. It is an experimental feature and seems to break here. There seems to be a lot of work ongoing regarding pd.NA and np.nan (e.g. here), so it is likely a bug.

Notice that t = pd.array([1665045912937689151, np.nan], dtype='Int64') fails as well, as it ends up with [1665045912937689088, <NA>]. The problem seems to be the difference between np.nan and pd.NA as s = pd.array([1665045912937689151, pd.NA], dtype='Int64') yields the correct [1665045912937689151, <NA>]. Probably you will have to wait till np.nan is switched to pd.NA in pd.read_csv.

0

By default pandas cast integer to float when there are empty or NaN values which leads to precision loss if you have large integer. To overcome this use na_filter=False in read_csv().

Solution:

import pandas as pd
import numpy as np

df = pd.read_csv('sample.csv', index_col=0, na_filter=False)

print(df)

Output:

               epoch_1              epoch_2
0  1665045912937687151  1665045912937689151
1                                          
rafathasan
  • 524
  • 3
  • 15