1

I have the following dataframe.


     user  movie rating                                                                 
0      1   1     3

1      1   2     4

2      2   1     2

3      2   2     5

4      3   1     3

My desired output is


movie  1   2 
user                                                                   
1      3   4 

2      2   5  

3      3   0  

If a user has not rated a movie, I need to have '0' in the corresponding output column, otherwise, the rating value should be present.

Note: I was able to achieve this with pivot_table, but the catch is my dataset contains more than 100000 columns because of which I am getting "Unstacked DataFrame is too big, causing int32 overflow". I am trying groupby as an alternative to bypass this error.

I am trying the following, but it doesn't include the values from 'value' column of my dataframe.

df.groupby(['user', 'movie']).size().unstack('movie', fill_value=0)
likes2code
  • 11
  • 3
  • there is no `3` in your movies but this is a simple `pivot/crosstab` with a `fillna` `pd.crosstab(df.user,df.movie,values=df['rating'],aggfunc='sum').fillna(0)` – Umar.H Jul 28 '19 at 13:32
  • agree. removed the column 3 in the dataframe – likes2code Jul 29 '19 at 12:15

2 Answers2

1

try using crosstab:

pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0)
# movie    1    2
# user           
# 1      3.0  4.0
# 2      2.0  5.0
# 3      3.0  0.0

to get interger values, just use .astype(int), as follows :

pd.crosstab(df.user, df.movie, values = df.rating, aggfunc = 'first').fillna(0).astype(int)
# movie  1  2 
# user          
# 1      3  4
# 2      2  5
# 3      3  0
Ayoub ZAROU
  • 2,387
  • 6
  • 20
  • This works on the sample dataframe I had provided in the question. But, I am still getting "Unstacked DataFrame is too big, causing int32 overflow" error as my input dataframe has 283228 rows (exact number) – likes2code Jul 29 '19 at 12:10
0

I'm not sure why would you expect movie 3 since it doesn't exist in the original data sample but other than that this will work for you: movie_ratings.set_index(['user', 'movie']).unstack('movie', fill_value=0)

Alex Fish
  • 768
  • 6
  • 18
  • This works on the sample dataframe I had provided in the question. But, I am still getting "Unstacked DataFrame is too big, causing int32 overflow" error as my input dataframe has 283228 rows (exact number) – likes2code Jul 29 '19 at 12:10
  • This worked for me with 300,000 columns. – Alex Fish Jul 30 '19 at 02:00