With this simplified dataframe :
import pandas as pd
import numpy as np
data= {"order_id": [55,55,56,107,107,107],"id_product":[100000158,100000339,100000431,100000148,100000269,100000269],"qty":[2,1,5,1,1,1]}
df = pd.DataFrame(data)
+----+----------+------------+-----+
| | order_id | id_product | qty |
+----+----------+------------+-----+
| 0 | 55 | 100000158 | 2 |
| 1 | 55 | 100000269 | 1 |
| 2 | 56 | 100000431 | 5 |
| 3 | 107 | 100000148 | 1 |
| 4 | 107 | 100000269 | 1 |
| 5 | 107 | 100000269 | 1 |
+----+----------+------------+-----+
I'd like to remove duplicates on (order_id,id_product) so to keep only one id_product by order_id AND SUM() the quantity (qty). In a SQL-like query it would be done with :
SELECT order_id,id_product,SUM(qty) FROM table_x GROUP BY order_id,id_product;
To have as a final result :
+----+----------+------------+-----+
| | order_id | id_product | qty |
+----+----------+------------+-----+
| 0 | 55 | 100000158 | 2 |
| 1 | 55 | 100000269 | 1 |
| 2 | 56 | 100000431 | 5 |
| 3 | 107 | 100000148 | 1 |
| 4 | 107 | 100000269 | 2 |
+----+----------+------------+-----+
Full dataframe :
import pandas as pd
import numpy as np
data = {"order_id":[55,55,56,107,107,107],"id_product":[100000158,100000339,100000431,100000148,100000269,100000269],"qty":[2,1,5,1,1,1],"name":["john","john","bob","smith","smith","smith"],"zipcode":["75001","75001","75002","75003","75003","75003"],"bat":["A","A","B","C","C","C"]}
df = pd.DataFrame(data)