0

I have a dataset of sales with 400K + of lines, and i need to execute a pivote below to bring for order line all SKUs in order, put SKUs in columns . I need make it for all orders, because after that i will create another table with these data.

However i get the error below:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

This is the first time that i apply this method in a big dataset, and i will need to scale that for more bigest datasets.

This is my code.

import pandas as pd
import csv
from pandas import *
import os
import numpy as np

df1 = pd.read_csv('sales.csv')
df1 = df1.drop_duplicates()

df1.index=df1['ORDER_ID']
df3 = df1.assign(col=df1.groupby(level=0).SKU_ID.cumcount()).pivot(columns='col', values='SKU_ID').reset_index()

There are some way to execute that in ranges and concat that results? I still dont find way to do that.

Caio Euzébio
  • 182
  • 1
  • 1
  • 10
  • Maybe this will help: https://stackoverflow.com/questions/61757170/python-unstacked-dataframe-is-too-big-causing-int32-overflow – Andrej Kesely Apr 26 '21 at 20:07
  • do you have sample data to recreate this? you can try dask - https://docs.dask.org/en/latest/. i tried with some large data from http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ but could not recreate it. – simpleApp Apr 27 '21 at 02:17
  • Sure, i will share dataset with you. – Caio Euzébio Apr 27 '21 at 20:04

0 Answers0