0

I know using == for float is generally not safe. But does it work for the below scenario?

  1. Read from csv file A.csv, save first half of the data to csv file B.csv without doing anything.
  2. Read from both A.csv and B.csv. Use == to check if data match everywhere in the first half.

These are all done with Pandas. The columns in A.csv have types datetime, string, and float. Obviously == works for datetime and string, so if == works for float as well in this case, it saves a lot of work.

It seems to be working for all my tests, but can I assume it will work all the time?

jf328
  • 6,841
  • 10
  • 58
  • 82

2 Answers2

3

The same string representation will become the same float representation when put through the same parse routine. The float inaccuracy issue occurs either when mathematical operations are performed on the values or when high-precision representations are used, but equality on low-precision values is no reason to worry.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
3

No, you cannot assume that this will work all the time.

For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.

To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:

>>> import pandas as pd
>>> import numpy as np

Now create some random values, and put them into a Pandas Series object:

>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')

Now we use the to_csv method to write these values out to a file, and then read the contents of that file back into a DataFrame:

>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')

Finally, let's extract the values from the relevant column of df and compare. We'll sum the result of the == operation to find out how many of the 10000 input values were recovered exactly.

>>> sum(test_values == df['test_values'])
7808

So approximately 78% of the values were recovered correctly; the others were not.

This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision argument to the CSV reader. By supplying float_precision='round_trip' to the read_csv operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:

>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000

Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv, this time using regularly-spaced values instead of random ones:

>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)

Now we read A.csv, and write the first half of the data back out again to B.csv, as in your Step 1.

>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)

Then we read in both A.csv and B.csv, and compare the first half of A with B, as in your Step 2.

>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251

So again, several of the values don't compare correctly. Opening up the files, A.csv looks pretty much as I expect. Here are the first 15 entries in A.csv:

,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015

And here are the corresponding entries in B.csv:

,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015

See this bug report for more information on the introduction of the float_precision keyword to read_csv.

Mark Dickinson
  • 29,088
  • 9
  • 83
  • 120