0

With this simplified dataframe :

import pandas as pd
import numpy as np
data= {"order_id": [55,55,56,107,107,107],"id_product":[100000158,100000339,100000431,100000148,100000269,100000269],"qty":[2,1,5,1,1,1]}
df = pd.DataFrame(data)

+----+----------+------------+-----+
|    | order_id | id_product | qty |
+----+----------+------------+-----+
|  0 |       55 |  100000158 |   2 |
|  1 |       55 |  100000269 |   1 |
|  2 |       56 |  100000431 |   5 |
|  3 |      107 |  100000148 |   1 |
|  4 |      107 |  100000269 |   1 |
|  5 |      107 |  100000269 |   1 |
+----+----------+------------+-----+

I'd like to remove duplicates on (order_id,id_product) so to keep only one id_product by order_id AND SUM() the quantity (qty). In a SQL-like query it would be done with :

SELECT order_id,id_product,SUM(qty) FROM table_x  GROUP BY order_id,id_product;

To have as a final result :

+----+----------+------------+-----+
|    | order_id | id_product | qty |
+----+----------+------------+-----+
|  0 |       55 |  100000158 |   2 |
|  1 |       55 |  100000269 |   1 |
|  2 |       56 |  100000431 |   5 |
|  3 |      107 |  100000148 |   1 |
|  4 |      107 |  100000269 |   2 |
+----+----------+------------+-----+

Full dataframe :

import pandas as pd
import numpy as np
data = {"order_id":[55,55,56,107,107,107],"id_product":[100000158,100000339,100000431,100000148,100000269,100000269],"qty":[2,1,5,1,1,1],"name":["john","john","bob","smith","smith","smith"],"zipcode":["75001","75001","75002","75003","75003","75003"],"bat":["A","A","B","C","C","C"]}
df = pd.DataFrame(data)
IndiaSke
  • 348
  • 1
  • 2
  • 10

1 Answers1

1

I think this will get you what you need:

new_df = (df.groupby(['order_id','id_product'])['qty'].sum()).reset_index()

it prints:

   order_id  id_product  qty
0        55   100000158    2
1        55   100000339    1
2        56   100000431    5
3       107   100000148    1
4       107   100000269    2
sophocles
  • 13,593
  • 3
  • 14
  • 33
  • 1
    Perfect. How do I handle other columns though ? I mean to not losing them – IndiaSke Nov 24 '20 at 17:46
  • Not the best way, but merge them back in based on order_id and product_id would be 1 solution? If you provide me with a complete dataframe I might be able to give you the full answer. – sophocles Nov 24 '20 at 17:49