The issue
I have a csv file containing large integer values that I want to perform some arithmetic operations on, these fields might contain nan values, now when I load these into a df using the pandas to_csv method, when there are no nan values present, these are loaded as 'int' and the precision seems to be correct, but when nan values are present, these get converted to 'float' and I see a precision loss.
The sample csv file ->
,epoch_1,epoch_2
0,1665045912937687151,1665045912937689151
1,,
After loading ->
[1] df = pd.read_csv('sample.csv', index_col=0)
[2] df
epoch_1 epoch_2
0 1.665046e+18 1.665046e+18
1 NaN NaN
[3] df['diff'] = df['epoch_2'] - df['epoch_1']
[4] df
epoch_1 epoch_2 diff
0 1.665046e+18 1.665046e+18 2048.0
1 NaN NaN NaN
As you can see the 3rd column has an incorrect value, the correct value should be 2000.
If there are no nan values the result computed is correct.
What I've tried
I've tried specifying the dtype as Int64 while loading the data
[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': pd.Int64Dtype(), 'epoch_2': pd.Int64Dtype()})
[2] df
epoch_1 epoch_2
0 1665045912937687296 1665045912937689088
1 <NA> <NA>
[3] df['diff'] = df['epoch_2'] - df['epoch_1']
[4] df
epoch_1 epoch_2 diff
0 1665045912937687296 1665045912937689088 1792
1 <NA> <NA> <NA>
As you can see this also results in precision loss and inturn an incorrect result.
Workaround I don't want to use
What I can do is load the data as str, remove the NaN columns, and then convert these fields to 'int64' and calculate the result, this gives correct result:
[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': str, 'epoch_2': str})
[2] df
epoch_1 epoch_2
0 1665045912937687151 1665045912937689151
1 NaN NaN
[3] df = df[~df['epoch_1'].isna()]
[4] df['diff'] = df['epoch_2'].astype(int) - df['epoch_1'].astype(int)
[5] df
epoch_1 epoch_2 diff
0 1665045912937687151 1665045912937689151 2000
But I need to keep the entries with nan values in the final df, so will have to add those entries back, this method spends a lot of computation just between doing conversions and will be a bottleneck when the size of df & number of fields to compute increase, it also isn't very elegant, so I'm looking for a better way to achieve this.
UPDATE
Another thing that seems to work:-
[1] df = pd.read_csv('sample.csv', index_col=0, dtype=str)
[2] df
epoch_1 epoch_2
0 1665045912937687151 1665045912937689151
1 NaN NaN
[3] df['diff'] = df['epoch_2'].astype('Int64') - df['epoch_1'].astype('Int64')
[4] df
epoch_1 epoch_2 diff
0 1665045912937687151 1665045912937689151 2000
1 NaN NaN <NA>
It seems better than dropping na values and adding them again, though this too requires type conversion before operations which I would like to avoid if possible.
This also raises another doubt that why does it lose precision when specifying the dtype for the columns as Int64 in read_csv, but works correctly when loaded as str and then converting to Int64, does read_csv internally load data as float64 and then converts it to the specified dtype?