5

I am trying to make item-item collaborative recommendation code. My full dataset can be found here. I want the users to become rows, items to become columns, and ratings to be the values.

My code is as follows:

import pandas as pd     
import numpy as np   
file = pd.read_csv("data.csv", names=['user', 'item', 'rating', 'timestamp'])
table = pd.pivot_table(file, values='rating', index=['user'], columns=['item'])

My data is as follows:

             user        item  rating   timestamp
0  A2EFCYXHNK06IS  5555991584       5   978480000  
1  A1WR23ER5HMAA9  5555991584       5   953424000
2  A2IR4Q0GPAFJKW  5555991584       4  1393545600
3  A2V0KUVAB9HSYO  5555991584       4   966124800
4  A1J0GL9HCA7ELW  5555991584       5  1007683200

And the error is:

Traceback (most recent call last):  
  File "D:\python\reco.py", line 9, in <module>   
    table=pd.pivot_table(file,values='rating',index=['user'],columns=['item'])  
  File "C:\python35\lib\site-packages\pandas\tools\pivot.py", line 133, in   pivot_table     
        table = agged.unstack(to_unstack)   
  File "C:\python35\lib\site-packages\pandas\core\frame.py", line 4047, in       unstack  
    return unstack(self, level, fill_value)
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 402, in   unstack      
    return _unstack_multiple(obj, level)    
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 297, in   _unstack_multiple  
    unstacked = dummy.unstack('__placeholder__')  
  File "C:\python35\lib\site-packages\pandas\core\frame.py", line 4047, in   unstack  
    return unstack(self, level, fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 406, in   unstack  
    return _unstack_frame(obj, level, fill_value=fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 449, in   _unstack_frame  
    fill_value=fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 103, in   __init__  
    self._make_selectors()  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 137, in   _make_selectors  
    mask = np.zeros(np.prod(self.full_shape), dtype=bool)  
ValueError: negative dimensions are not allowed
Igor Raush
  • 15,080
  • 1
  • 34
  • 55
Prashant Sharma
  • 155
  • 1
  • 3
  • 11
  • Possible duplicate of [ValueError: negative dimensions are not allowed](http://stackoverflow.com/questions/19938587/valueerror-negative-dimensions-are-not-allowed) – Hamms Dec 13 '16 at 23:33
  • @Hamms. Do not mark it as duplicate, I have already seen the link you provided. But none of the answers there is helpful to my situation. I am not doing any matrix multiplication. – Prashant Sharma Dec 13 '16 at 23:55
  • please include a sample of your data: [mcve](http://stackoverflow.com/help/mcve). It is absolutely critical here, since this `pivot_table` call works for this sample data: `df = pd.DataFrame(np.random.rand(10,4), columns=['user','item','rating','timestamp'])`. – Julien Marrec Dec 14 '16 at 00:04
  • @JulienMarrec I have added the data sample to question. – Prashant Sharma Dec 14 '16 at 00:16
  • And I have absolutely no problem using your own pivot_table call with the data you provided... Try it yourself: copy the data you provided, load it with `file = pd.read_clipboard()` and then `table=pd.pivot_table(file,values='rating',index=['user'],columns=['item'])`. You need to provide a MCVE: so post a sample of your data that is sufficient to replicate the error you're having. – Julien Marrec Dec 14 '16 at 00:20
  • i have added link to whole dataset. @JulienMarrec – Prashant Sharma Dec 14 '16 at 00:47
  • Hum, I'm not trying to be difficult here, but a pivot on 836,005 rows times 4 columns to track down your error is a bit much: try to narrow down the problem (MCVE, m=minimal) (FYI, I did try it... my computer just won't do it, the tasks get killed, running out of memory) – Julien Marrec Dec 14 '16 at 01:03
  • @JulienMarrec Same problem I faced.I did try to do it with 100K rows, Still it showed error. Can you suggest any other idea to pivot like this. I have mentioned in my problem , what i am trying to get. – Prashant Sharma Dec 14 '16 at 01:12
  • Just to be clear, I don't get any error, my ipython kernel just crashes. I've tried it on like a 100 different sample of 0.01% of your data (`df = file.sample(frac=0.1)` then pivot_table df) and that works nicely though. I'm not sure how you managed to get the above Error Trace. As far as running out of memory, you'd need to this using something like dask (pandas dataframe out of memory basically) or an HDF store: read [this question](http://stackoverflow.com/questions/29439589/how-to-create-a-pivot-table-on-extremely-large-dataframes-in-pandas) for this. – Julien Marrec Dec 14 '16 at 01:19
  • Also, you do realize that what you intend to do is to create a dataframe with like a gazillion NaNs? Are you sure you need this? – Julien Marrec Dec 14 '16 at 01:21
  • @JulienMarrec I will fill those NaNs by mean values. – Prashant Sharma Dec 14 '16 at 02:01

1 Answers1

6

I cannot guarantee that this will complete (I got tired of waiting for it to compute), but here's a way to create a sparse dataframe that hopefully should minimize memory and help.

import pandas as pd
import numpy as np
file=pd.read_csv("data.csv",names=['user','item','rating','timestamp'])

from scipy.sparse import csr_matrix

user_u = list(sorted(file.user.unique()))
item_u = list(sorted(file.item.unique()))

row = file.user.astype('category', categories=user_u).cat.codes
col = file.item.astype('category', categories=item_u).cat.codes

data = file['rating'].tolist()

sparse_matrix = csr_matrix((data, (row, col)), shape=(len(user_u), len(item_u)))

df = pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], 
                       index=user_u, columns=item_u, default_fill_value=0)

See this question for more options.

Community
  • 1
  • 1
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • 1
    +1, this is the only way to deal with this data. The full dense ratings matrix will have >127B entries, far too big to fit into memory. You can also use `Series.cat.categories` to index your sparse data frame, to avoid the `list(sorted(...))` thing. – Igor Raush Dec 14 '16 at 02:37
  • @julien I shall try this. Thanks a lot for your help. I am stuck at this problem for last two days. – Prashant Sharma Dec 14 '16 at 03:17
  • Let it run and please let me know either way if it worked or not, I'm curious – Julien Marrec Dec 14 '16 at 12:27