The primary issue is that if you're using conventional pandas methods to create a pivot table with too many rows and columns, even if most of the values use a fill default for missing row, column
pairs (in your example, a rating of 0 for all movies users haven't rated), the total number of values can cause integer overflow and also surpass available memory.
The solution is to use sparse data structures. This SO question has an answer that walks through how to do this using csr_matrix
from scipy.sparse
and CategoricalDtype
from pandas.api.types
, but it relies on pd.SparseDataFrame
which has been removed from pandas in recent years.
Here is code that should be able to handle the example in your question.
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
rcLabel, vLabel = ('userID', 'titleID'), 'rating'
rcCat = [CategoricalDtype(sorted(df[col].unique()), ordered=True) for col in rcLabel]
rc = [df[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((df[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfOut = ( pd.DataFrame.sparse.from_spmatrix(
mat, index=rcCat[0].categories, columns=rcCat[1].categories) )
Code to generate sample input:
from random import randrange
dfLen = 3_500_000
titles = [f'tt{randrange(0,10_000_000):07}' for _ in range(dfLen)]
df = pd.DataFrame({
'titleID':titles,
'userID':[f'ur{randrange(0,67_000_000):08}' for _ in range(dfLen)],
'primaryTitle':titles,
'rating':[float(randrange(1,11)) for _ in range(dfLen)]})
Here are the assumptions I have made:
- the input dataframe is named
df
, the pivoted output is named dfOut
- there are 67 million unique users
- there are 10 million unique titles
- there are 3.5 million rows (i.e.,
userID, titleID
pairs)
Observations:
- In a dense (non-sparse) pivot, this could create as many as 3.5 million squared or about 10 trillion values, in comparison with about 3.5 million values in a sparse representation.
- In my example (which selects users and titles at random from within the assumed populations), the sparse result dimensions are [3410055 rows x 2952380 columns] and the
info()
method reports memory usage: 195.1+ MB
.