0

Forgive me if this is somewhat simple but I can't find any answers elsewhere

I have a largeish DataFrame of raw transactional data that I am breaking down to do aggregations by user.

my df looks like:

transid, userid, catname, amount, date, origin
1, id1, cat1, 300, 01/03/2014, cc
2, id1, cat1, 200, 02/03/2014, cc
3, id1, cat2, 250, 01/03/2014, ca
4, id2, cat1, 130, 01/03/2014, cc
5, id2, cat2, 150, 01/03/2014, cc

circa 1million trans records with 115k users

I am creating a wide table with metrics of aggregations with date and userid the index.

date, userid, countoftrans, sumtrans
01/03/2014, id1, 3, 750, 

Now, I am happily doing this but find when I am getting to doing the category slicing it is getting very manual.

I am creating series and then merging them back into a data frame but I am doing named slicing which is a) manual and b) is slow

#take a slice from the df
cat1 = df[df.catname == "cat1"]
#group the slice by user
cat1out = cft.groupby(cft.userid)
#then do calc on series
cat1count = cat1.transid.count()

Edit: The result set needs to be in the form of 1 line user id with a wide set of columns e.g.

Output type dataframe

So while the Groupby aggregate does indeed organize my data logically its how I want (I just don't quite get GroupbyObjects aswell)

I have 12 parent categories and then between 3-5 sub categories and will have to do this across all so would like a succinct way to do this.

I have tried to do a for loop on distinct cat names and then insert the item. That didn't really work but is there a more pandithic(?) way

Thanks

Rob

RHSMan
  • 157
  • 2
  • 15

1 Answers1

1

If I understand your desired output correctly, this should do the trick. (Not sure why you are slicing by category first, but you can always add it to the groupby list.

df.groupby(['date','userid'])['amount'].agg([sum,len])

EDIT:

After seeing your comment and your edits to the question, I understand that what you want is one row per user id, and for each transaction category, several columns with summary statistics. So the row for user101 will have number of cat1 transactions, sum of cat1 transactions, number of cat2 transactions, sum of cat2 transactions, etc. Right?

In that case, here's what you want:

grouped = df.groupby(['userid','catname'])
aggregated = grouped['amount'].agg([len,sum])
result = aggregated.unstack('catname')

This will give you the dataframe you want, albeit with a hierarchical column index. If you don't like it, this answer should set you on the way to collapsing that index.

Community
  • 1
  • 1
exp1orer
  • 11,481
  • 7
  • 38
  • 51
  • Thanks, that helps me with the basics of simplifying the aggregation process perfectly but then gets me into a somewhat complex situation with my groupby object. I guess I am creating a sparse matrix at the end. – RHSMan Oct 10 '14 at 15:43
  • If you edit your question to explain why this is different from what you want, I can update my answer! – exp1orer Oct 10 '14 at 16:45
  • I am just struggling with the concept of 'is it this easy?!'. I think so!! Thanks – RHSMan Oct 10 '14 at 17:23
  • I have thought about it and now believe it doesn't solve my problem (although its very useful) updated my original question. I need a single line df with wide set of columns (hence why I was slicing up by category). It may just be a case of taking that groupedby object and mapping to DF but it still feels like extra 'effort'. – RHSMan Oct 10 '14 at 18:20