0

I tried to import some data from an Excel file to a pandas DataFrame, convert it into a csv file and read it back in (need to do some further file based handling on that exported csv file later on, so that is a necessary step).

For the sake of data integrity, exported and re-imported data should be the same. So, I compared the different DataFrames and encountered, that these are not the same, at least according to pandas' .equals() function.

I thought this might be an issue related to string encoding when exporting and re-importing the data since I had to transfer char encoding etc. while file handling. However, I was able to reproduce similar behavior without any encoding-related issues as follows:

import pandas as pd
import numpy as np

# https://stackoverflow.com/a/32752318
df1 = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))

df1.to_csv('foo.csv', index=False)
df2 = pd.read_csv('foo.csv')

df1.to_csv('bar.csv', index=True)
df3 = pd.read_csv('bar.csv')

print(df1.equals(df2), df1.equals(df3), df2.equals(df3))

print(all(df1 == df2))

Why does .equals() tell that the DataFrames differ, but all(df1 == df2) tells they are equal? According to the docs, .equals() even considers NaNs at same locations to be equal, whereas df1 == df2 should not. Due to this, comparing different DataFrames with .equals() is less strict than df1 == df2, but does not return the same result in the example I provided.

Which criteria do df1 == df2 and df1.equals(df2) consider I am not aware of? I assume, that the implementation inside pandas is correct (did not look into the implementation inside the code itself, but export and re-import should be a standard interface test case). What am I doing wrong then?

albert
  • 8,027
  • 10
  • 48
  • 84

2 Answers2

2

I think that df1.equals(df2) return False because it takes into account the DataFrame dtype. df1 should have int32 columns, while df2 should have int64 columns (you can use the info() method to verify it). You can specify the df2 dtype as follow in order to have the same dtype of df1:

df2 = pd.read_csv('foo.csv', dtype=np.int32)

if dtype is the same, .equals() should return True

FiloCara
  • 666
  • 7
  • 7
0

When you write dataframe to .csv format with index=True ; it adds up extra column with name Unnamed: 0. That's why both .equals() and all(df1 == df2) tells dataframes are different. But, if you write .csv with index=False it will not add up an extra column and you will get output .csv equal to input dataframe. If you don't care about dataframe index you can set index=False while writing dataframe to .csv or use pd.read_csv('bar.csv').drop(['Unnamed: 0'],axis=1) while reading csv.

Sopan Kurkute
  • 128
  • 1
  • 9