1

I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way

NULL    MovieID1 MovieID2 MovieID3
UserID1 Rating   Rating   Rating
UserID2 Rating   Rating   Rating

I've already been through following links

  1. One - Dataset is much huge put it in series
  2. Two - Transpose of Row not mentioned
  3. Three - Tried with reindex so as to get NaN values in one column
  4. Four - df.iloc and df.ix didn't work either

I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.

    NULL    MovieID1 MovieID2 MovieID3
    UserID1 Rating   Rating   NaN
    UserID2 Rating   NaN      Rating

P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package

EDIT 1 - Sorted the data and exported but got an additional field

df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2

The file produces foll. sorted.csv file

,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118

EDIT 2 - As asked in Comments

Here's the format of Data in u.data file which acts as input

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
T3J45
  • 717
  • 3
  • 12
  • 32

1 Answers1

2

One method:

Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.

df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')

Second method (no duplicate userid, movieid records):

df.set_index(['UserID','MovieID'])['Rating'].unstack()

Third method (no duplicate userid, movieid records):

df.pivot(index='UserID',columns='MovieID',values='Rating')

Fourth method (like the first you can choose your aggregation method):

df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()

Output:

MovieID  1   2   3   4   5   6   7   8   9   10
UserID                                         
1         5   3   4   3   3   5   4   1   5   3
Scott Boston
  • 147,308
  • 15
  • 139
  • 187