No, you cannot assume that this will work all the time.
For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv
function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.
To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:
>>> import pandas as pd
>>> import numpy as np
Now create some random values, and put them into a Pandas Series
object:
>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')
Now we use the to_csv
method to write these values out to a file, and then read the contents of that file back into a DataFrame
:
>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')
Finally, let's extract the values from the relevant column of df
and compare. We'll sum the result of the ==
operation to find out how many of the 10000
input values were recovered exactly.
>>> sum(test_values == df['test_values'])
7808
So approximately 78% of the values were recovered correctly; the others were not.
This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision
argument to the CSV reader. By supplying float_precision='round_trip'
to the read_csv
operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:
>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000
Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv
, this time using regularly-spaced values instead of random ones:
>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)
Now we read A.csv
, and write the first half of the data back out again to B.csv
, as in your Step 1.
>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)
Then we read in both A.csv
and B.csv
, and compare the first half of A
with B
, as in your Step 2.
>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251
So again, several of the values don't compare correctly. Opening up the files, A.csv
looks pretty much as I expect. Here are the first 15 entries in A.csv
:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015
And here are the corresponding entries in B.csv
:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015
See this bug report for more information on the introduction of the float_precision
keyword to read_csv
.