I tried to import some data from an Excel file to a pandas DataFrame, convert it into a csv file and read it back in (need to do some further file based handling on that exported csv file later on, so that is a necessary step).
For the sake of data integrity, exported and re-imported data should be the same. So, I compared the different DataFrames and encountered, that these are not the same, at least according to pandas' .equals()
function.
I thought this might be an issue related to string encoding when exporting and re-importing the data since I had to transfer char encoding etc. while file handling. However, I was able to reproduce similar behavior without any encoding-related issues as follows:
import pandas as pd
import numpy as np
# https://stackoverflow.com/a/32752318
df1 = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))
df1.to_csv('foo.csv', index=False)
df2 = pd.read_csv('foo.csv')
df1.to_csv('bar.csv', index=True)
df3 = pd.read_csv('bar.csv')
print(df1.equals(df2), df1.equals(df3), df2.equals(df3))
print(all(df1 == df2))
Why does .equals()
tell that the DataFrames differ, but all(df1 == df2)
tells they are equal? According to the docs, .equals()
even considers NaN
s at same locations to be equal, whereas df1 == df2
should not. Due to this, comparing different DataFrames with .equals()
is less strict than df1 == df2
, but does not return the same result in the example I provided.
Which criteria do df1 == df2
and df1.equals(df2)
consider I am not aware of? I assume, that the implementation inside pandas is correct (did not look into the implementation inside the code itself, but export and re-import should be a standard interface test case). What am I doing wrong then?