Pandas pivot_table/groupby taking too long on very large dataframe

Question

I am working on a dataframe of 18 million rows with the following structure:

I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.

I have tried using pivot_table with the following code:

df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum

But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:

df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)

Any help is greatly appreciated! I have been stuck on this for hours.

Could you please add [MWE](https://stackoverflow.com/help/minimal-reproducible-example) (for pandas there are good hints in the answers in [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples))? — sophros, Sep 29 '20 at 09:13
your data set is to large for Pandas, you need to use a relational dbms (SQL) or something like Dask or Pyspark. — Umar.H, Sep 29 '20 at 09:15
@Manakin, this [question](https://stackoverflow.com/questions/55404617/faster-alternatives-to-pandas-pivot-table) claims they are able to do it in mere seconds — Krithika Raghavendran, Sep 29 '20 at 09:21
1. all the data in that question is categorical, 2. your data is not like. 3. Pandas is memory dependent, they may have run that on a 128 GB ram machine — Umar.H, Sep 29 '20 at 09:47

score -1 · Answer 1 · answered Sep 01 '22 at 14:01

It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using

step 1

 df['col1'] = df['col1'].astype('object')

 df['col2'] = df['col2'].astype('object')

step 2

 df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...

This was independent of dataframe size...

Pandas pivot_table/groupby taking too long on very large dataframe

1 Answers1

Linked