3

I have two medium-sized datasets which looks like:

books_df.head()

    ISBN    Book-Title  Book-Author
0   0195153448  Classical Mythology     Mark P. O. Morford
1   0002005018  Clara Callan    Richard Bruce Wright
2   0060973129  Decision in Normandy    Carlo D'Este
3   0374157065  Flu: The Story of the Great Influenza Pandemic...   Gina Bari Kolata
4   0393045218  The Mummies of Urumchi  E. J. W. Barber

and

ratings_df.head()

    User-ID     ISBN    Book-Rating
0   276725  034545104X  0
1   276726  0155061224  5
2   276727  0446520802  0
3   276729  052165615X  3
4   276729  0521795028  6

And I wanna get a pivot table like this:

ISBN    1   2   3   4   5   6   7   8   9   10  ... 3943    3944    3945    3946    3947    3948    3949    3950    3951    3952
User-ID                                                                                 
1   5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5   0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

I've tried:

R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow

which failed for:

MemoryError:

and this:

R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()

which failed for the same.

I want to use it for singular value decomposition and matrix factorization.

Any ideas?

The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Stanislav Jirák
  • 465
  • 3
  • 7
  • 17
  • Your "medium-sized dataset" becomes enormous when you expand it from 2 to 4000 columns. How many unique users and ISBNs do you have? – ALollz Jul 31 '19 at 17:51

1 Answers1

3

One option is to use pandas Sparse functionality, since your data here is (very) sparse:

In [11]: df
Out[11]:
   User-ID        ISBN  Book-Rating
0   276725  034545104X            0
1   276726  0155061224            5
2   276727  0446520802            0
3   276729  052165615X            3
4   276729  0521795028            6

In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')

In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN     0155061224  034545104X  0446520802  052165615X  0521795028
User-ID
276725            0           0           0           0           0
276726            5           0           0           0           0
276727            0           0           0           0           0
276729            0           0           0           3           6

In [14]: _.dtypes
Out[14]:
ISBN
0155061224    Sparse[int64, 0]
034545104X    Sparse[int64, 0]
0446520802    Sparse[int64, 0]
052165615X    Sparse[int64, 0]
0521795028    Sparse[int64, 0]
dtype: object

My understanding is that you can then use this with scipy e.g. for SVD:

In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Could this work for a string column using .first() instead of .mean()? I am getting `ValueError: could not convert string to float`. Thanks – Dudelstein Apr 12 '23 at 13:02