Python: Unstacked DataFrame is too big, causing int32 overflow

Question

I have a big dataset and when I try to run this code I get a memory error.

user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

here is the error:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

I have run it on another machine and it worked fine! how can I fix this error?

First I used pivot and I got this error. unstack uses less memory compared to pivot. — Hamid, May 12 '20 at 16:51

Guillermo Mosse · Answer 1 · 2020-05-17T19:18:43.350

2

According to Google, you can downgrade your pandas version to 0.21 which has no problem with pivot table and too big data.

edited May 17 '20 at 19:18

answered May 12 '20 at 17:17

Guillermo Mosse

462
2
14

score 2 · Answer 2 · answered Nov 13 '20 at 02:29

As @Ehsan pointed out, we can pivot the tables in chunks.

Suppose you have a DataFrame with 3,355,205 rows!
Let's build chunks of size 5000:

chunk_size = 5000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)

0 4999
5000 9999
10000 14999
15000 19999
20000 24999
25000 29999
30000 34999
35000 39999
40000 44999
45000 49999
50000 54999
55000 59999
60000 64999
65000 69999
70000 74999
75000 79999
..continue..

All you have to do now is a list comprehension inside a pd.concat():

df_new = pd.concat([df.iloc[ chunks[i]:chunks[i + 1] - 1 ].pivot(index='user_id', columns='item', values='views') for i in range(0, len(chunks) - 1)])

This answer is good when you have to make a sparse matrix to some recommendation system.
After this you could do:

from scipy import sparse
spr = sparse.coo_matrix(df_new.to_numpy())

Using this solution I get a ValueError: Index contains duplicate entries, cannot reshape. This is most likely because when chunking, there apparently are rows for a specific in this case I think user_id which end up I different chunks. How do I go about that? I tried replacing pivot with pivot_table because that's what I used earlier, and it seemed to work at first, following tqdm the list comprehension succeeds, but something (pd.concat ?) is causing it to infinitely run or crashes the kernel — , Mar 02 '22 at 09:01

score 1 · Accepted Answer · answered May 12 '20 at 17:20

As it turns out this was not an issue on pandas 0.21. I am using a Jupyter notebook and I need the latest version of pandas for the rest of the code. So I did this:

!pip install pandas==0.21
import pandas as pd
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
!pip install pandas

This code works on the Jupyter notebook. First, it downgrades pandas to 0.21 and runs the code. After having the required dataset it updates pandas to the latest version. check the issue raised on GitHub here. This post was also helpful to increase memory of Jupyter notebook.

Kamal Hossain · Answer 4 · 2021-02-21T12:35:16.370

Some suggestions were to downgrade to pandas==0.21 which not really a feasible solution!

I faced the same issue and needed to have an urgent fix for the unexpected int32 overflow. One of our recommendation model was running in production and at some point number of users base increased to more than 7 million records with around 21k items.

So, to solve the issue I chunked the dataset as mentioned @igorkf, create the pivot table using unstack and append it gradually.

import pandas as pd 
from tqdm import tqdm

chunk_size = 50000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)
0 49999
50000 99999
100000 149999
150000 199999
200000 249990
.........................



pivot_df = pd.DataFrame()

for i in tqdm(range(0, len(chunks) - 1)):
    chunk_df = df.iloc[ chunks[i]:chunks[i + 1] - 1]
    interactions = (chunk_df.groupby([user_col, item_col])[rating_col]
      .sum()
      .unstack()
      .reset_index()
      .fillna(0)
      .set_index(user_col)
    )
    print (interactions.shape)
    pivot_df = pivot_df.append(interactions, sort=False)

And then I have to make a sparse matrix as input to lightFM recommendation model (run matrix-factorization algorithm). You can use it for any use case where unstacking is required. Using the following code, converted to sparse matrix-

from scipy import sparse
import numpy as np
sparse_matrix = sparse.csr_matrix(df_new.to_numpy())

NB: Pandas has pivot_table function which can be used for unstacking if your data is small. For my case, pivot_table was really slow.

score 1 · Answer 5 · answered Jun 20 '22 at 00:27

1

Check out Dask. It allows you to scale tools like pandas, NumPy and Scikit-learn. It is a useful tool to manipulate large datasets, even when those datasets don’t fit in memory.

answered Jun 20 '22 at 00:27

Asiak

41
4

score 0 · Answer 6 · answered Jun 28 '22 at 19:11

0

This issue was fixed in the later versions of pandas (1.4.X) #45084. Upgrade your pandas version and you should be through.

answered Jun 28 '22 at 19:11

Ashish Bidap

1
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 29 '22 at 04:29

Python: Unstacked DataFrame is too big, causing int32 overflow

6 Answers6

Linked