1

I have a Pandas df which looks like this:

|   | yyyy_mm_dd | id | product    | status | is_50 | cnt |
|---|------------|----|------------|--------|-------|-----|
|   | 2002-12-15 | 7  | prod_rs    | 2      | 0     | 8   |
|   | 2002-12-15 | 16 | prod_go    | 2      | 0     | 1   |
|   | 2002-12-15 | 16 | prod_mb    | 2      | 0     | 3   |
|   | 2002-12-15 | 29 | prod_er    | 2      | 0     | 2   |
|   | 2002-12-15 | 29 | prod_lm    | 2      | 0     | 2   |
|   | 2002-12-15 | 29 | prod_ops   | 2      | 0     | 2   |

I also have a second dataframe which is similar:

|   | id | product    | cnt |
|---|----|------------|-----|
|   | 7  | prod_rs    | 8   |
|   | 16 | prod_go    | 1   |
|   | 16 | prod_mb    | 3   |
|   | 29 | prod_er    | 2   |
|   | 29 | prod_lm    | 2   |
|   | 29 | prod_ops   | 6   |

How can I create a third dataframe which will only store the rows which do not have an equal count? Based on the above, only the last row would be returned as the cnt for the id / product combination differs. Example output:

|   | id | product | cnt_df1 | cnt_df2 |
|---|----|---------|---------|---------|
|   | 29 | prod_ops| 2       | 6       |

The second df is one row larger in size so not all id / product combinations may be present in both dataframes.

I've been looking at merge but I'm unsure how to use when the cnt columns are not equal.

stackq
  • 491
  • 2
  • 15
  • the usual way is to do a full outer join and use the `query` method `pd.merge(df1, df2, on=["id", "product"], how="outer", indicator=True, suffixes=["_a", "_b"] ).query("cnt_a != cnt_b")` – Umar.H Dec 16 '20 at 13:04

2 Answers2

3

You would still use merge and just check whether the count columns are different in a second step

In [40]: df = pd.merge(df1.drop(["yyyy_mm_dd", "", "status", "is_50"], axis=1), df2, on=['id', 'product'], suffixes=['_df1', '_df2'])                                                                              

In [41]: df                                                                                                                                                                                                        
Out[41]: 
     id       product cnt_df1      cnt_df2
0   7     prod_rs        8            8   
1   16    prod_go        1            1   
2   16    prod_mb        3            3   
3   29    prod_er        2            2   
4   29    prod_lm        2            2   
5   29    prod_ops       2            6   

Now you can simply filter out all rows with the same cnt e.g. with query()

In [42]: df.query("cnt_df1 != cnt_df2")                                                                                                                                                                            
Out[42]: 
     id       product cnt_df1      cnt_df2
5   29    prod_ops       2            6  
maow
  • 2,712
  • 1
  • 11
  • 25
0

You can acheive this in two steps like so:

# Merge the DataFrames
df3 = df1.merge(df2, on=["id", "product"])
# Filter for where `cnt` are not equal
df3 = df3[df3["cnt_x"].ne(df3["cnt_y"])]

#    yyyy_mm_dd  id   product  status  is_50  cnt_x  cnt_y
# 5  2002-12-15  29  prod_ops       2      0      2      6

You can use the suffixes parameter on merge if you don't want the to use the default _x and _y.

Alex
  • 6,610
  • 3
  • 20
  • 38