0

Is there a way to compare 2 dataframes with multiple columns, and varies in length (1386 vs 1383 in the below example)? And only output the rows which have the differences?

Example data:

> df_left  



      index        location       date
0         0        Adelaide       2019-01-01
1         1        Adelaide       2019-02-01
2         2        Adelaide       2019-03-01
3         3        Adelaide       2019-04-01
4         4        Adelaide       2019-05-01
...     ...            ...        ...
1381   1381  Western London       2019-03-01
1382   1382  Western London       2019-04-01
1383   1383  Western London       2019-05-01
1384   1384  Western London       2019-06-01
1385   1385  Western London       2019-07-01

[1386 rows x 2 columns]


> df_right
                  location       date
0                 Adelaid        2019-01-01
1                 Adelaide       2019-02-01
2                 Adelaide       2019-03-01
3                 Adelaide       2019-04-01
4                 Adelaide       2019-05-01
...                   ...        ...
1378        Western London       2019-03-01
1379        Western London       2019-04-01
1380        Western London       2019-05-01
1381        Western London       2019-06-01
1382        Western London       2019-07-01

[1383 rows x 2 columns]

I tried this, but it does not yield the differences

pd.concat([df_left_with_index_columns,df_right_with_index_columns], ignore_index=True, axis=1, join="outer")                                                                                                       
                        0          1                2          3
0                Adelaide  2019-01-01        Adelaide  2019-01-01
1                Adelaide  2019-02-01        Adelaide  2019-02-01
2                Adelaide  2019-03-01        Adelaide  2019-03-01
3                Adelaide  2019-04-01        Adelaide  2019-04-01
4                Adelaide  2019-05-01        Adelaide  2019-05-01
...                   ...         ...             ...         ...
1381       Western London  2019-03-01  Western London  2019-06-01
1382       Western London  2019-04-01  Western London  2019-07-01
1383       Western London  2019-05-01             NaN         NaT
1384       Western London  2019-06-01             NaN         NaT
1385       Western London  2019-07-01             NaN         NaT

[1386 rows x 4 columns]
samxiao
  • 2,587
  • 5
  • 38
  • 59
  • Does [this answer](https://stackoverflow.com/questions/17095101/outputting-difference-in-two-pandas-dataframes-side-by-side-highlighting-the-d) help you? The first part shows how to get a boolean array of which rows are equal. – WolfgangK Sep 10 '19 at 07:19
  • @WolfgangK I tried, but got the error `ValueError: Can only compare identically-labeled DataFrame objects` – samxiao Sep 10 '19 at 07:38
  • what about concatenating the 2 dataframe and dropping dupplicates ? – Mayeul sgc Sep 10 '19 at 08:01
  • @Mayeulsgc That might work if the duplicates are on 2 columns combined. – samxiao Sep 10 '19 at 17:40
  • @Mayeulsgc I tried it and it works, except 1 corner case where you have longer Left or Right with duplicates rows, then this will drop them too. – samxiao Sep 10 '19 at 18:15
  • Yes but isn't it the objective ? or do you want to drop duplicates only in one of the 2 dataframes ? – Mayeul sgc Sep 11 '19 at 07:35
  • @Mayeulsgc Sorry, that's not the objective. We still want to report the duplicates if those don't exist in the other dataframe. – samxiao Sep 25 '19 at 07:22

0 Answers0