1

I have seen examples of Pandas' aggregate function, but those don't solve my problem. Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df only has these summed attributes or the attributes used in groupby. In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df.

I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below.

snipped of data

In my transaction dataset, Customer_ID are unique for each customer and entry time is unique for each transaction. Any customer will have multiple transactions during a period of time. Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID, one gender, age, entry time, location, country and all the Tag attributes.

If I group by only customer_ID, entry time and sum the Tags, the resulting dataframe has the correct number of unique customers: 150K. But I lose the attributes gender, age, location, country, exit time, value 1, value 2 in the resulting df.

result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

If I group by all the needed attributes and sum the Tags, I only get 90K unique customers, which is not correct.

result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
 ])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

Example of rows for 1 transaction Example of how i want 1 transaction to be

So how do I efficiently group by only customer_ID and entry time, sum all the Tag columns and still retain other attributes in the resulting df (df size is around 700 MB)?

Ted
  • 1,189
  • 8
  • 15
sethds
  • 15
  • 4
  • i have edited question to add more details, on how existing examples from pandas aggregate function dont help with my problem – sethds Aug 21 '19 at 12:27
  • Can you check to see the number of unique values in each of the `["customer_ID","entry time”,"gender", "age","location", "country", "exit time", "value 1", "value 2"]` columns by using `nunique()`? – Ted Aug 21 '19 at 13:05
  • Here's the unique values in each of those attributes: Customer_ID = 142973 entry time = 1032769 “gender” = 2 “age” = 5 “location” = 23 “country” = 6 "exit time” = 1032615 "value 1”= 1589 "value 2” = 252 – sethds Aug 21 '19 at 13:31
  • That's actually interesting that adding grouping fields lowers row counts! Are you sure this occurs? [Reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) (instead of screenshots) would be great to have. – Parfait Aug 21 '19 at 13:53

2 Answers2

1

Ok, if I understand the question correctly, then I think this may work:

tag_cols = ["Tag1", "Tag2", "Tag3", "Tag4", "Tag5", "Tag6", "Tag7", "Tag8"]
join_cols = ["customer_ID", "entry time"]

df1 = df.groupby(join_cols)[tag_cols].sum().reset_index()
df2 = pd.merge(df1, df, on=tag_cols.append(join_cols), how="left")

Then df2 should have what you need.

Ted
  • 1,189
  • 8
  • 15
  • I wonder about the OP's claim of lower record count with adding `groupby` columns (150K to 90K). Even if *age*, *location*, etc. changed just **once** within each *customer_ID* and *entry time*, rows should increase. So, if counts actually increase then a left join `merge` will duplicate rows to the level of the larger table. – Parfait Aug 21 '19 at 14:05
  • @Parfait Thanks, I notice now that the OP also had a repeat `Tag4`. I understood the original question as being that duplicating rows in the event of age, location, etc changing was what was actually desired. That assumption may be wrong. – Ted Aug 21 '19 at 14:11
  • Hi, to clarify, the Tag4 repetation in my code was a typo, and fixing that didnt changed the problem on hand. Also, for each unique customer, age, location etc are unique, so for each separate transaction, these values wont change, but these values are needed for those rows. Am trying the 1st solution given here and its still merging df1, df2 for last 20 minutes. Will update results later. – sethds Aug 21 '19 at 16:10
  • this solution, resulted in correct sum of values in the tag_cols, customer_ID and entry time values were also fine, but all other columns like age, gender, location, values 1,2 etc were filled with NaN's for all rows in df. I haven't been able to figure out why it happened though. – sethds Aug 21 '19 at 17:58
  • @sethds Interesting. In that case, I acquiesce in Parfait's more elegant solution. – Ted Aug 22 '19 at 07:09
1

Technically, you are attempting to aggregate on unique customer_ID and entry time (not unique customers). In order to maintain the other attributes, some aggregate decision has to be made for which values to retain. Consider extending a groupby().aggregate call to retrieve the first, last, min or max value.

agg_df = (df.groupby(['customer_ID', 'entry time'], as_index=False)
            .aggregate({'gender':'first', 'age':'first', 
                        'location':'first', 'country':'first', 
                        'exit time':'first', 'value 1':'first', 'value 2':'first',
                        'Tag1':'sum', 'Tag2':'sum', 'Tag3':'sum', 'Tag4':'sum', 
                        'Tag5':'sum', 'Tag6':'sum', 'Tag7':'sum', 'Tag8':'sum'})
         )
Parfait
  • 104,375
  • 17
  • 94
  • 125