5

I have an original pandas dataframe, let's call it df. I convert the dataframe to a csv file, and then convert it back to a pandas dataframe. When I call df.equals(new dataframe), it returns false. I thought one error could be that the indexing could be off, so I set the new dataframe's index as the first column of the csv file (which is the index of the original dataframe) but still getting the same result.

Example code:

import pandas as pd

df = <stuff here that aggregates other dataframes into one>
file_name = 'test/aggregated_reports.csv'
df.to_csv(file_name)
df2 = pd.read_csv(file_name, index_col=0)
assert df.equals(df2)

I did some manual testing by converting df2 into a csv again and comparing the 2 csvs (file_name and the csv created from df2.to_csv()) and they appeared to be identical, so I'm assuming the "difference" occurs when converting the original dataframe to a csv file. But I still can't quite figure it out...

Any insights on what may be causing the "difference" here would be greatly appreciated!

Louis
  • 1,123
  • 5
  • 15
  • 24
  • Possible duplicate of [Precision lost while using read\_csv in pandas](https://stackoverflow.com/questions/36909368/precision-lost-while-using-read-csv-in-pandas) – EliadL Jun 23 '19 at 12:35

1 Answers1

1

This may be just a rounding error (I'm assuming your data is numeric). If you're storing floating point numbers as text, reading it back in tends to result in a slight error. See below - try comparing the numeric data using a difference rather than .equals().

import pandas as pd
import numpy as np

df = pd.DataFrame(
    columns=['a', 'b', 'c'],
    index=[0, 1, 2, 3] * 3,
    data=np.random.random((12, 3)))

file_name = 'mydata.csv'
df.to_csv(file_name)
df2 = pd.read_csv(file_name, index_col=0)

print(df.equals(df2))                            # Returns False
print(np.all(np.abs((df - df2) < 10 ** -10)))    # Returns True

Some other options to look at:

compare = (df == df2)      # Dataframe of True/False
compare.all()              # By column, True if all values are equal
compare.count()            # By column, how many values are equal

# Return any rows where there was a difference
df.where(~compare).dropna(how='all')
Simon Bowly
  • 1,003
  • 5
  • 10
  • Ah didn't even think of floats, that's a good point. My dataframe is actually a report that has a mix of strings and numbers, hence (df - df2) wouldn't work. Is there a workaround here that would still take into account potential float rounding errors? – Louis Nov 28 '17 at 01:10
  • If this makes my question more defined, my dataframe has a mixture of strings, numbers, and emptiness (nan). – Louis Nov 28 '17 at 01:15
  • Oh ok, I'm not sure if there is a way to ignore strings. You could try (df == df2), which will return a dataframe of True/False values showing which ones differ. Aggregating that with .all() or .count(), which work on columns, would narrow down where the issues are. – Simon Bowly Nov 28 '17 at 03:01
  • I've added a few other options to the answer, hopefully they help narrow down which entries are different. If you're still stuck, can you show some sample data in your question? – Simon Bowly Nov 28 '17 at 03:10
  • I have the same issue. Index, columns, datatypes are all the same.... – Soerendip Sep 07 '20 at 03:39