0

I am dealing with file-file and file-sql comparisons. Since the size of my data is large so I am forced to use chunksize of pandas dataframe. For testing purpose I have used the same data in both file-file as well as file-sql. When I do the file-file comparison all works out good. However when I use chunksize for read_sql_query, things works out fine for the first chunk, but I get following message when second chunk is being processed:

ValueError: Can only compare identically-labeled DataFrame objects

Error happens specifically at this code

ne_stacked = (src_df != tgt_df).stack()

I tried to get any difference in the columns but all look good:

print(src_df.columns)

Index(['firstname', 'lastname', 'account_num', 'salary', 'rental_income', 'int_yield', 'dividend', 'royalty', 'mortgage', 'car_loan', 'rent', 'other_expense', 'created_at', 'updated_at'], dtype='object')

print(tgt_df.columns)

Index(['firstname', 'lastname', 'account_num', 'salary', 'rental_income', 'int_yield', 'dividend', 'royalty', 'mortgage', 'car_loan', 'rent', 'other_expense', 'created_at', 'updated_at'], dtype='object')

Can you please help in figuring out what is going on.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Vaibhav
  • 102
  • 1
  • 9
  • That's perhaps a problem with the index. Does this helps ? https://stackoverflow.com/a/18548888/14394522 – Rivers Oct 24 '20 at 10:41
  • No, it did not help. Added `tgt_df.sort_index(inplace=True)` but still getting same issue. – Vaibhav Oct 24 '20 at 11:10
  • What about this ```(src_df.sort_index().sort_index(axis=1)) != (tgt_df.sort_index().sort_index(axis=1))``` ? And you could try ```.reset_index(drop=True, inplace=True)``` on each dataframe too. – Rivers Oct 25 '20 at 15:01

0 Answers0