Is there an efficient way to compare two dataframes of different sizes?

Question

I found this post, but it's not quite my scenario. Is there an efficient way of comparing two data frames

The reason I want to compare two dataframes is that I am looking for changes that may have occured. (Think "audit"). The two frames have exactly the same column layout, just that one may have more or less rows than the other, and values may have changed in two of the columns.

       ID       Period          Price       OtherID
0      10100001 2019-10-01     8995.00         ABBD
1      38730001 2019-11-01    38227.71         EIRU
2      30100453 2019-12-01    22307.00         DDHF
3      92835543 2020-01-01     2310.00         DDGF
4      66453422 2020-02-01    12113.29         DGFH


       ID       Period          Price       OtherID
0      10100001 2019-10-01        5.00         ABBD
1      38730001 2019-11-01    38227.71         XXXX
2      30100453 2019-12-01    22307.00         DDHF
3      92835543 2020-01-01     2310.00         DDGF
4      66453422 2020-02-01    12113.29         DGFH
5      22223422 2020-02-01      123.29         HHYG

The two columns that I am suspicious about are "Price" and "OtherID". I want to find any changes in the Price or OtherID.

I see three scenarios:

1. row has been added
2. row has been deleted
3. row doesn't match

I can iterate through it all, but I'm wondering if there is some pandas magic out there that will do it in one fell swoop.

The output I seek is something like this:

       ID       Period         Analysis
0      10100001 2019-10-01     Changed
1      38730001 2019-11-01     Changed
3      22223422 2020-02-01     New

And just to be clear, ID by itself is not unique. And Period by itself is also not unique. The two together are.

Use `merge` on a common column (ID) with `indicator=True` and `how='outer'`? — Corralien, Nov 07 '21 at 20:15
But after merging I would still have to compare against either of the two dataframes to filter out the non-changed rows — Maxcot, Nov 07 '21 at 20:19

Corralien · Accepted Answer · 2021-11-08T07:15:03.850

5

Use compare after merge your 2 dataframes on ID and Period:

out = pd.merge(df1, df2, on=['ID', 'Period'], how='outer',
               suffixes=('_df1', '_df2')).set_index(['ID', 'Period'])

out.columns = pd.MultiIndex.from_tuples(out.columns.str.split('_').map(tuple)) \
                           .swaplevel()

out = out['df1'].compare(out['df2'])

Output:

>>> out
                      Price         OtherID      
                       self   other    self other
ID       Period                                  
10100001 2019-10-01  8995.0    5.00     NaN   NaN
38730001 2019-11-01     NaN     NaN    EIRU  XXXX
22223422 2020-02-01     NaN  123.29     NaN  HHYG

# Summarize
>>> out.swaplevel(axis=1)['self'].isna().all(axis=1) \
       .replace({True: 'New', False: 'Changed'})
ID        Period    
10100001  2019-10-01    Changed
38730001  2019-11-01    Changed
22223422  2020-02-01        New
dtype: object

If you append keep_shape=True parameter:

>>> out['df1'].compare(out['df2'], keep_shape=True)
                      Price         OtherID      
                       self   other    self other
ID       Period                                  
10100001 2019-10-01  8995.0    5.00     NaN   NaN
38730001 2019-11-01     NaN     NaN    EIRU  XXXX
30100453 2019-12-01     NaN     NaN     NaN   NaN
92835543 2020-01-01     NaN     NaN     NaN   NaN
66453422 2020-02-01     NaN     NaN     NaN   NaN
22223422 2020-02-01     NaN  123.29     NaN  HHYG

edited Nov 08 '21 at 07:15

answered Nov 07 '21 at 20:23

Corralien

109,409
8
28
52

@Maxcot. Based on your comment, you can use `compare` after `merge`. – Corralien Nov 07 '21 at 20:27
Sorry, one other fact I did not mention, is that a unique row is defined by both ID and Period. Neither of them are unique in their own right. – Maxcot Nov 07 '21 at 20:36
I updated my answer. Do you really need your indicator (changed, new)? – Corralien Nov 07 '21 at 20:59
@Corralien- what does out['df1'] in the 3rd line refer to? Isn't the index meant to refer to a column name in the df ? – Arindam Paul Roy Jun 15 '23 at 06:08
1

@ArindamPaulRoy. Yes that's it. After `merge`, the column names are `Price_df1`, `Other ID_df1`, `Price_df2` and `Other ID_df2`. I split these names on "_" and I create a `MultiIndex`. So the column name are now `('df1', 'Price')`, `('df1', 'OtherID')`, `('df2', 'Price')` and `('df2', 'OtherID')`. So I can subset the dataframe using the first level `'df1'` and `'df2'` (column names and not variable names). Does this help you understand better? – Corralien Jun 15 '23 at 06:15
Lovely, thanks for that @Corralien! – Arindam Paul Roy Jun 15 '23 at 06:40

score 0 · Answer 2 · answered Nov 07 '21 at 20:34

I think you should try apply

Make a separate dataframe with desired headers

df_update = pd.DataFrame(updated_data,columns =['ID','PERIOD'])

And then apply this function

result = df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

It worked for me and I generate the updated entries dataframe through this. Let me know if its your desired solution

Is there an efficient way to compare two dataframes of different sizes?

2 Answers2