1

I have this dataset.

user    Month   item
 A       Jan     X
 A       Jan     Y
 A       Feb     X
 B       Jan     Z
 B       Feb     X
 A       March   Z

I require a result like the following:

user   month Itemset  CumItemset   DistinctCount    CumDistinctCount
 A      Jan    X,Y       X,Y            2                 2
 A      Feb    X         X,Y            1                 2
 A      March  Z         X,Y,Z          1                 3
 B      Jan    Z         Z              1                 1
 B      Feb    X         Z,X            1                 2

I tried the code here but I want the cumulative count to restart with each new user.

Any ideas?

azuber
  • 389
  • 4
  • 12

1 Answers1

3

Absolutely No Claims For Speediness

df = df.sort_values('user')

g1 = df.groupby(['user', 'Month'], sort=False).item.apply(list)

g2 = g1.groupby('user').apply(lambda x: x.cumsum()).apply(pd.unique)

pd.concat(
    [
        g1.apply(','.join), g2.apply(','.join),
        g1.str.len(), g2.str.len()
    ], axis=1, keys='Itemset CumItemset DistinctCount CumDistinctCount'.split()
).reset_index()

  user  Month Itemset CumItemset  DistinctCount  CumDistinctCount
0    A    Jan     X,Y        X,Y              2                 2
1    A    Feb       X        X,Y              1                 2
2    A  March       Z      X,Y,Z              1                 3
3    B    Jan       Z          Z              1                 1
4    B    Feb       X        Z,X              1                 2
piRSquared
  • 285,575
  • 57
  • 475
  • 624