0

How does one use pandas to create frequency counts for each user for each category. I would like to do this so I can pivot to create a utility matrix

|--|**author** | **category**|   
0|  A | movies  
1|  B | games  
2|  C | pics  
4|  A | movies  
5|  C | movies  
6|  B | games 




|--|**author** | **category count**|   

A | movies |2 |  
B | games  |2 |  
C | movies |1 |  
C | pics   |1 | 
Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122

1 Answers1

0

You can use groupby with size for getting length of all categories in columns author and category - output is Series with MultiIndex.

print (df.groupby(['author','category']).size())
author  category
A       movies      2
B       games       2
C       movies      1
        pics        1
dtype: int64

Then add reset_index for creating columns from MultiIndex and set column name for value column - output is DataFrame:

df = df.groupby(['author','category']).size().reset_index(name='category count')
print (df)
  author category  category count
0      A   movies               2
1      B    games               2
2      C   movies               1
3      C     pics               1

But if need crosstab there is multiple solutions:

#add unstack for reshape
df1 = df.groupby(['author','category']).size().unstack(fill_value=0)
print (df1)
category  games  movies  pics
author                       
A             0       2     0
B             2       0     0
C             0       1     1

df1 = pd.crosstab(df['author'],df['category'])
print (df1)
category  games  movies  pics
author                       
A             0       2     0
B             2       0     0
C             0       1     1

df1 = df.pivot_table(index='author',columns='category', aggfunc='size', fill_value=0)
print (df1)
category  games  movies  pics
author                       
A             0       2     0
B             2       0     0
C             0       1     1

EDIT:

What is the difference between size and count in pandas?

Graham
  • 7,431
  • 18
  • 59
  • 84
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Awesome, Thanks for a working solution. You even went the extra mile to show me the code for the utility matrix. If you didn't mind you could explain, why using the size/reset index does what it does? – Vince Kumar Mar 27 '17 at 07:29
  • Sure, give me a sec. – jezrael Mar 27 '17 at 07:30
  • I try add some explanation, maybe also help [10min to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) and [cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html). If something unclear, I try explain more. – jezrael Mar 27 '17 at 07:35
  • Thank you! [size] (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.size.html) has no description on the documentation so I was pretty confused, but it make sense. Although I think that is a odd named method – Vince Kumar Mar 27 '17 at 07:39
  • Yes, there is also count function, but it ia a bit different. See last edit, I add link for better explanation. – jezrael Mar 27 '17 at 07:41