2

I have a pandas dataframe df like this, say

ID activity date
1  A        4
1  B        8
1  A        12
1  C        12
2  B        9
2  A        10
3  A        3
3  D        4

and I would like to return a table that counts the number of occurence of some activity in a precise list, say l = [A, B] in this case, then

ID activity(count)_A  activity(count)_B
1  2                  1
2  1                  2
3  1                  0

is what I need.

What is the quickest way to perform that ? ideally without for loop

Thanks !

Edit: I know there is pivot function to do this kind of job. But in my case I have much more activity types than what I really need to count in the list l. Is it still optimal to use pivot ?

yiyang
  • 75
  • 7

3 Answers3

1

You can use isin with boolean indexing as first step and then pivoting - fastest should be groupby, size and unstack, then pivot_table and last crosstab, the best test each solution with real data:

df2 = (df[df['activity'].isin(['A','B'])]
         .groupby(['ID','activity'])
         .size()
         .unstack(fill_value=0)
         .add_prefix('activity(count)_')
         .reset_index()
         .rename_axis(None, axis=1))

print (df2)
   ID  activity(count)_A  activity(count)_B
0   1                  2                  1
1   2                  1                  1
2   3                  1                  0

Or:

df1 = df[df['activity'].isin(['A','B'])]

df2 = (pd.crosstab(df1['ID'], df1['activity'])
        .add_prefix('activity(count)_')
        .reset_index()
        .rename_axis(None, axis=1))

Or:

df2 = (df[df['activity'].isin(['A','B'])]
          .pivot_table(index='ID', columns='activity', aggfunc='size', fill_value=0)
          .add_prefix('activity(count)_')
          .reset_index()
          .rename_axis(None, axis=1))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I believe df.groupby('activity').size().reset_index(name='count') should do as you expect.

IWHKYB
  • 481
  • 3
  • 11
0

Just aggregate by Counter and use pd.DataFrame default constructor

from collections import Counter

agg_= df.groupby(df.index).ID.agg(Counter).tolist()
ndf = pd.DataFrame(agg_)

    A   B   C   D
0   2   1.0 1.0 NaN
1   1   1.0 NaN NaN
2   1   NaN NaN 1.0

If you have l = ['A', 'B'], just filter

ndf[l]

    A   B   
0   2   1.0 
1   1   1.0 
2   1   NaN
rafaelc
  • 57,686
  • 15
  • 58
  • 82