0

I have written a Python script which loads one or several csv files, concatenates them and writes the whole into a single new csv file. I have noticed that certain values are modified during this operation, being slightly incremented/decremented by very small values. As an example:

Original CSV:

Index SomeValue
0.000000    0.000
1.000000    0.000
2.000000    0.000
3.000000    0.000
4.000000    2.527
5.000000    0.000

Saved CSV:

Index SomeValue
0.0 0.0
1.0 0.0
2.0 0.0
3.0 0.0
4.0 2.5269999999999997
5.0 0.0

This looks like a full-scale error to me, but I don't know what causes it. The pandas core of my script, which is called in a loop, is:

l_tmpCsv_st = pd.read_csv(l_listElement_tc, sep='\t', index_col=0)
l_listOfCsvFiles_tst.append(l_tmpCsv_st)
# Fills in nan cells with the value "missing" to distinguish betweens a true nan and a missing value due to lacking padding
l_listOfCsvFiles_tst[-1] = l_listOfCsvFiles_tst[-1].fillna(value='missing')

# Concatenating csv file with previous ones
csvFusion = pd.concat([csvFusion, l_listOfCsvFiles_tst[-1]], axis=1)

And after the loop:

# Padding missing values of lower frequency files
csvFusion = csvFusion.fillna(method='pad')
# Determinating which columns need to be deleted (all "Unnamed" columns are panda-error results and need to be removed)
l_listColumnsToDelete_tst = [col for col in csvFusion.columns if 'Unnamed' in col]
# Dropping these columns
csvFusion.drop(l_listColumnsToDelete_tst, axis=1, inplace=True)
# Writing full stuff to file
csvFusion.to_csv(l_endFile_tc, sep='\t', decimal=',', na_rep='-')

The rest of my script is unrelated to pandas and would only harm readability, thus I have removed it from my copy/paste.

How could I avoid this issue?

Thanks in advance,

Edition:

It was indeed a floating point error. Rounding every value to a sufficient high digit solved it:

for col in csvFusion.columns:
    csvFusion[col] = csvFusion[col].round(15)
Domack
  • 75
  • 11

1 Answers1

0

I think you need parameter float_format in to_csv, because floating point precission:

print df.to_csv(float_format='%.3f')
Index,SomeValue
0.000,0.000
1.000,0.000
2.000,0.000
3.000,0.000
4.000,2.527
5.000,0.000

I think you can use round:

 df['SomeValue'] = df['SomeValue'].round(3)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • But, printing it wont change the fact that the number is probablyy the one shown in hos answer, not this one – Ander Biguri Mar 29 '16 at 09:06
  • I am storing several values of different sizes. I have provided an example with only three digits, but some values may have a higher precision, and I don't really like the idea of storing every number in %.15f, as it would bloat my end file. – Domack Mar 29 '16 at 09:13
  • Yes, understand. Is possible use [`round`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.round.html), e.g. `df['SomeValue'] = df['SomeValue'].round(3)`? – jezrael Mar 29 '16 at 09:15
  • Wouldn't rounding these values cause the same issue, as the way these numbers are stored does not change? I'll give it a try, thanks. – Domack Mar 29 '16 at 09:17
  • Hmmm, hard question. Maybe it can help, maybe not. But I have no other idea for this problem. – jezrael Mar 29 '16 at 09:19
  • It worked. I just have to iterate over every columns. I'll edit my question to integrate this answer. Thanks a lot. :) – Domack Mar 29 '16 at 09:21