Best way to compare two huge dataframes in python

Question

I have a use case where I need to compare a file from s3 bucket with the output of a sql query.

Below is how I am reading the s3 file.

if s3_data is None:                                                                                 
   s3_data = pd.read_feather(BytesIO(obj.get()['Body'].read()))                                    
else:                                                                                               
   s3_data = s3_data.append(pd.read_feather(BytesIO(obj.get()['Body'].read())), ignore_index=True)

Below is how I am reading from the database.

db_data=pandas.read_sql

Now I need to compare s3_data with db_data. Both of these dataframes are huge with as much as 2 million rows of data each.

The format of the data is somewhat like below.

name | Age | Gender
--------------------
Peter| 30  | Male
Tom  | 24  | Male
Riya | 28  | Female

Now I need to validate whether exact same rows with same column data exist in both s3 file and db. I tried using dataframe.compare() but the kind of results it gives is not what I am looking for. The position of the row in the dataframe is not relevant for me. So if 'Tom' appears in row 1 or 3 in one of the dataframes while on a different position in other, it should still pass the equality validation as long as the record itself with same column values is present in both. dataframe.compare() is not helping me in this case.

The alternate approach which I took is to use csv_diff. I merged all the columns into one in the following way and saved it as a csv in my local creating two csv files- one for s3 data and one for db data.

data
------------
Peter+30+Male
Tom+24+Male
Riya+28+Female

Then, using below code, I am comparing the files.

s3_file = load_csv(open("s3.csv"),key="data")
db_file = load_csv(open("db.csv"),key="data")
diff = compare(s3_file,db_file)

This works but is not very performant as I have to first write huge csv files with size as big as 500mb to local and then read and compare them this way.

Is there a better way to handle the comparison part without the need to write files to local and also at the same time ensuring that I am able to compare entire records(with each column value compared for a given row) for equality irrespective of the position of the row in the dataframe?

If not pyspark and clusters might be the solution. You can try joining dfs on one/two column and then find the discrepancies — mas, Apr 15 '21 at 08:47
This [solution ](https://stackoverflow.com/questions/20225110/comparing-two-dataframes-and-getting-the-differences)solves the problem very neatly. — Naxi, Apr 16 '21 at 14:48

Best way to compare two huge dataframes in python

0 Answers0