How to subtract two dataframes with duplicate first column?

Question

So, I have the following two dataframes and my ideal output is to get open_orders reduced by cancel_orders so I know how many open_orders I have.

Desired Output:

df_total_orders
           order_id business_symbol  open_orders
0      a1b2c3111111              AA          0.0
1      4kl3l2242244             AAA          0.0
2      21312a224144             BBB          0.0
3      124f32rv2323             CCC          0.0
4      2412dfe21edf             DDD          0.0
5555

What I have now:

df_add_orders (this dataframe contains duplicates in the order id column)
           order_id business_symbol  open_orders
0      a1b2c3111111              AA        100.0
1      4kl3l2242244             AAA       1000.0
2      21312a224144             BBB       1000.0
3      124f32rv2323             CCC       1000.0
4      2412dfe21edf             DDD       1000.0
10000
df_cancel_orders (this dataframe contains duplicates in the order id column)
           order_id   cancel_orders
0      a1b2c3111111           100.0
1      4kl3l2242244          1000.0
2      21312a224144          1000.0
3      124f32rv2323          1000.0
4      2412dfe21edf          1000.0
4000

I am using the following group by to get the total open orders, but it's removing my symbols with the output.

df_add_orders_group = df_add_orders.groupby(['order_id'], as_index=False)['open_orders'].sum()

df_add_orders_group
           order_id        open_orders
0      a1b2c3111111              110.0
1      4kl3l2242244             1200.0
2      21312a224144             1500.0
3      124f32rv2323             1500.0
4      2412dfe21edf             1500.0
5000

I am then subtracting closed orders

df_cancel_orders_group = df_cancel_orders.groupby(['order_id'], as_index=False)['cancel_orders'].sum()

but I want to keep the symbols so I can compare the total open orders and I need some way to merge the result of the subtraction with the main df_add_orders and also clean up order_id to sum duplicates.

How do you get that output when your `df_add_orders` and `df_cancel_orders` are exactly the same? Shouldn't your output have all zeroes in `open_orders`? — not_speshal, Aug 21 '21 at 19:47
It's an example, but yes in this case it would be zeroes. I updated it. — Hold The Door, Aug 21 '21 at 19:58
Does this answer your question? ["Anti-merge" in pandas (Python)](https://stackoverflow.com/questions/38242368/anti-merge-in-pandas-python) — ti7, Aug 21 '21 at 20:01

Acccumulation · Answer 1 · 2021-08-22T18:14:03.273

df_total_orders =df_add_orders.merge(df_cancel_orders, 
    how = 'left', 
    on = 'order_id)

will get you a dataframe with the data from the two original dataframes. You can then do

df_total_orders['open_orders'] = 
    df_total_orders['open_orders']-
    df_total_orders['cancel_orders']

and then drop the cancel_order column.

Another tactic would be to append the two original dataframes together. You can then do groupby over id and sum. This will create a dataframe that has, for each id, one column that is the sum of all open orders for that id, and another column that is the sum of all cancel orders for that id. You can then take the difference between the two columns.

When you append the two dataframes, you'll get a bunch of nulls; the portion from the open orders will have nulls in the cancel order column, and vice versa. I think summing over the groupby will just treat the nulls as zeros, but you might need to explicitly tell it to do so.

OP has specified that `df_add_orders` and `df_cancel_orders` have duplicate `order_id` so `merge` will clearly not work. — not_speshal, Aug 22 '21 at 15:05

not_speshal · Answer 2 · 2021-08-22T15:09:26.697

Try with pandas.merge and groupby:

merged = pd.merge(df_add_orders.groupby("order_id", as_index=False) \
                               .agg({"business_symbol": "first", 
                                     "open_orders": "sum"}), 
                 df_cancel_orders.groupby("order_id", as_index=False).sum(), 
                 on="order_id")

output = merged.assign(open_orders=merged["open_orders"]-merged["cancel_orders"]) \
               .drop("cancel_orders", axis=1)

>>> output
       order_id business_symbol  open_orders
0  124f32rv2323             CCC          0.0
1  21312a224144             BBB          0.0
2  2412dfe21edf             DDD          0.0
3  4kl3l2242244             AAA          0.0
4  a1b2c3111111              AA          0.0

How to subtract two dataframes with duplicate first column?

2 Answers2