Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!
I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.
So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.
import pandas as pd
from tqdm import tqdm
chunk_size = 50000
chunks = [x for x in range(0, df.shape[0], chunk_size)]
for i in range(0, len(chunks) - 1):
print(chunks[i], chunks[i + 1] - 1)
0 49999
50000 99999
100000 149999
150000 199999
200000 249990
.........................
pivot_df = pd.DataFrame()
for i in tqdm(range(0, len(chunks) - 1)):
chunk_df = df.iloc[ chunks[i]:chunks[i + 1] - 1]
interactions = (chunk_df.groupby([user_col, item_col])[rating_col]
.sum()
.unstack()
.reset_index()
.fillna(0)
.set_index(user_col)
)
print (interactions.shape)
pivot_df = pivot_df.append(interactions, sort=False)
And then I have to make a sparse matrix as input to lightFM recommendation model (run matrix-factorization algorithm). You can use it for any use case where unstacking is required. Using the following code, converted to sparse matrix-
from scipy import sparse
import numpy as np
sparse_matrix = sparse.csr_matrix(df_new.to_numpy())
NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow.