I have seen examples of Pandas' aggregate
function, but those don't solve my problem. Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df
only has these summed attributes or the attributes used in groupby
. In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df
.
I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below.
In my transaction dataset, Customer_ID
are unique for each customer and entry time
is unique for each transaction. Any customer will have multiple transactions during a period of time. Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID
, one gender
, age
, entry time
, location
, country
and all the Tag attributes.
If I group by only customer_ID
, entry time
and sum the Tags, the resulting dataframe has the correct number of unique customers: 150K. But I lose the attributes gender
, age
, location
, country
, exit time
, value 1
, value 2
in the resulting df
.
result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()
If I group by all the needed attributes and sum the Tags, I only get 90K unique customers, which is not correct.
result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()
So how do I efficiently group by only customer_ID
and entry time
, sum all the Tag
columns and still retain other attributes in the resulting df
(df size is around 700 MB)?