Why pandas can't process more than 550 dataframes in this case

Question

import pandas as pd

file1 = 'Total.xlsx'
df1 = pd.read_excel(file1)

file2 = 'Recent.xlsx'
df2 = pd.read_excel(file2)

non_matching_rows = []

for index1, row1 in df1.iterrows():
    row_matches = False
    
    for index2, row2 in df2.iterrows():
        if row1.equals(row2):
            row_matches = True
            break
    
    if not row_matches:
        non_matching_rows.append(row1)

non_matching_df = pd.DataFrame(non_matching_rows)

display(non_matching_df)
print(non_matching_df.count())

Total.xlsx contains almost 40k records and Recent.xlsx has almost 36k records. I needed to find the remaining 4k records that is unique in Total.xlsx. I tried the above code, but it is not working for the entire excel files. When I tried to reduce the records in both the files, it can process and produce accurate results [only to 550 records]. Any file more than 500 records it is not working (I tried chunk size also, didn't win). Any suggestions??

score 1 · Accepted Answer · answered Jul 22 '23 at 05:13

Your code takes too long for large files due to O(n^2) time complexity. Use merge() method instead of iterating over the rows of the dataframes. Here's an example:

merged_df = pd.merge(df1, df2, how='outer', indicator=True)
non_matching_df = merged_df[merged_df['_merge'] == 'left_only'].drop('_merge', axis=1)

display(non_matching_df)
print(non_matching_df.count())

The indicator=True parameter adds a column _merge to the merged dataframe that indicates whether each row is present in both dataframes (both), only in the left dataframe (left_only), or only in the right dataframe (right_only).

Why pandas can't process more than 550 dataframes in this case

1 Answers1