0

I'm trying to pivot this dataframe ('refined_dataset'):

dataframe

Like this:

movieUser_df = refined_dataset.pivot(
    index='userID',
     columns='primaryTitle',
      ## Replacing all movies users haven't rated with a rating of 0
      values='rating').fillna(0)

And it returns the error: IndexError: index 875914235 is out of bounds for axis 0 with size 875909652

I've done this exact method with almost identical (but much smaller) datasets in the past, I looked up for why I might be having this problem and came across this post from 5 years ago where it's explained that it's an ongoing Pandas issue. Apart from a couple of comments on that post, of which the most recent was two years ago and I don't know if there's any updates since, I can't find others talking about this problem or possible solutions. Does anyone know if this Pandas issue is indeed my problem, and whether it is or isn't, if there's any ways I can try and do it differently?

Marcus
  • 261
  • 1
  • 4
  • 14
  • Prior to getting the error you have reported, I get this warning: `pandas\core\reshape\reshape.py:130: RuntimeWarning: overflow encountered in long_scalars num_cells = num_rows * num_columns`. – constantstranger Mar 19 '23 at 19:02

1 Answers1

0

The primary issue is that if you're using conventional pandas methods to create a pivot table with too many rows and columns, even if most of the values use a fill default for missing row, column pairs (in your example, a rating of 0 for all movies users haven't rated), the total number of values can cause integer overflow and also surpass available memory.

The solution is to use sparse data structures. This SO question has an answer that walks through how to do this using csr_matrix from scipy.sparse and CategoricalDtype from pandas.api.types, but it relies on pd.SparseDataFrame which has been removed from pandas in recent years.

Here is code that should be able to handle the example in your question.

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

rcLabel, vLabel = ('userID', 'titleID'), 'rating'
rcCat = [CategoricalDtype(sorted(df[col].unique()), ordered=True) for col in rcLabel]
rc = [df[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((df[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfOut = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

Code to generate sample input:

from random import randrange
dfLen = 3_500_000
titles = [f'tt{randrange(0,10_000_000):07}' for _ in range(dfLen)]
df = pd.DataFrame({
'titleID':titles,
'userID':[f'ur{randrange(0,67_000_000):08}' for _ in range(dfLen)],
'primaryTitle':titles,
'rating':[float(randrange(1,11)) for _ in range(dfLen)]})

Here are the assumptions I have made:

  • the input dataframe is named df, the pivoted output is named dfOut
  • there are 67 million unique users
  • there are 10 million unique titles
  • there are 3.5 million rows (i.e., userID, titleID pairs)

Observations:

  • In a dense (non-sparse) pivot, this could create as many as 3.5 million squared or about 10 trillion values, in comparison with about 3.5 million values in a sparse representation.
  • In my example (which selects users and titles at random from within the assumed populations), the sparse result dimensions are [3410055 rows x 2952380 columns] and the info() method reports memory usage: 195.1+ MB.
constantstranger
  • 9,176
  • 2
  • 5
  • 19