group by two columns and use third column as value without using pivot_table

Question

I have the following dataframe.


     user  movie rating                                                                 
0      1   1     3

1      1   2     4

2      2   1     2

3      2   2     5

4      3   1     3

My desired output is


movie  1   2 
user                                                                   
1      3   4 

2      2   5  

3      3   0

If a user has not rated a movie, I need to have '0' in the corresponding output column, otherwise, the rating value should be present.

Note: I was able to achieve this with pivot_table, but the catch is my dataset contains more than 100000 columns because of which I am getting "Unstacked DataFrame is too big, causing int32 overflow". I am trying groupby as an alternative to bypass this error.

I am trying the following, but it doesn't include the values from 'value' column of my dataframe.

df.groupby(['user', 'movie']).size().unstack('movie', fill_value=0)

there is no `3` in your movies but this is a simple `pivot/crosstab` with a `fillna` `pd.crosstab(df.user,df.movie,values=df['rating'],aggfunc='sum').fillna(0)` — Umar.H, Jul 28 '19 at 13:32

Ayoub ZAROU · Answer 1 · 2019-07-28T13:55:21.570

1

try using crosstab:

pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0)
# movie    1    2
# user           
# 1      3.0  4.0
# 2      2.0  5.0
# 3      3.0  0.0

to get interger values, just use .astype(int), as follows :

pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0).astype(int)
# movie  1  2 
# user          
# 1      3  4
# 2      2  5
# 3      3  0

edited Jul 28 '19 at 13:55

answered Jul 28 '19 at 13:36

Ayoub ZAROU

2,387
6
20

This works on the sample dataframe I had provided in the question. But, I am still getting "Unstacked DataFrame is too big, causing int32 overflow" error as my input dataframe has 283228 rows (exact number) – likes2code Jul 29 '19 at 12:10

score 0 · Answer 2 · answered Jul 28 '19 at 13:31

0

I'm not sure why would you expect movie 3 since it doesn't exist in the original data sample but other than that this will work for you: movie_ratings.set_index(['user', 'movie']).unstack('movie', fill_value=0)

answered Jul 28 '19 at 13:31

Alex Fish

768
6
18

This works on the sample dataframe I had provided in the question. But, I am still getting "Unstacked DataFrame is too big, causing int32 overflow" error as my input dataframe has 283228 rows (exact number) – likes2code Jul 29 '19 at 12:10
This worked for me with 300,000 columns. – Alex Fish Jul 30 '19 at 02:00

group by two columns and use third column as value without using pivot_table

2 Answers2