0

I have the following Dataframes

df1=
   col1 col2 col3 value
   A     B    AB   5
   B     C    AB   4
   C     D    AB   3
   D     E    AB   12

df2=
   col1 col2 col3 value
   A     B    AB   3
   C     D    AB   1
   Z     E    BC   2

I need a new df3 which has rows which are present in df1, but not in df2. Similarily I also need a df4, which has rows which are in df2, but not in df1.

df3= //Rows which in Df1 but not in df2

   col1 col2 col3 value
    B     C    AB   5
    D     E    AB   12

df4= //Rows which are in df2 but not in df1

   col1 col2 col3 value
    Z     E    BC   2

Please note that we can ignore the column 'value' , but col1 col2 col3 should match to be considered a match.

If I do a merge with indicator=true

mdf1=pd.merge(df1, df2, how='outer', indicator=True)
mdf1=
    col1 col2 col3  value      _merge
0    A    B   AB      5   left_only
1    B    C   AB      4   left_only
2    C    D   AB      3   left_only
3    D    E   AB     12   left_only

4    A    B   AB      3  right_only
5    C    D   AB      1  right_only
6    Z    E   BC      2  right_only

You see that in mdf1 row number 0 and 4, 2 and 5 are same but got mismatched because the value isnt same. I want to ignore the value column when merging but I want the column in result Data frame.

Here is what i need the mdf to be.

   col1 col2 col3  value      _merge

0    B    C   AB      4   left_only

1    D    E   AB     12   left_only

2    Z    E   BC      2  right_only
Bounty Collector
  • 615
  • 7
  • 19

1 Answers1

1

In your solution is necessary remove value column:

mdf1=pd.merge(df1.drop('value', 1), df2, how='outer', indicator=True)
mdf1 = mdf1[mdf1['_merge'] == 'right_only']
print (mdf1)
  col1 col2 col3  value      _merge
4    Z    E   BC    2.0  right_only

mdf2=pd.merge(df1, df2.drop('value', 1), how='outer', indicator=True)
mdf2 = mdf2[mdf2['_merge'] == 'left_only']
print (mdf2)
  col1 col2 col3  value     _merge
1    B    C   AB    4.0  left_only
3    D    E   AB   12.0  left_only
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    This is what i need, i am curious is to why index or mdf1/mdf2 is 4and 1,3 respectivly? should it not be starting from 0? if I were to dump it in csv, I would have to reset index? – Bounty Collector Apr 21 '20 at 14:05
  • @BountyCollector - Because there are originaly more rows in `mdf1`, `mdf2`. You can reset_index for default index like `mdf1 = mdf1[mdf1['_merge'] == 'right_only'].reset_index(drop=True)` – jezrael Apr 21 '20 at 14:07