0

I am working on a dataframe of 18 million rows with the following structure:

enter image description here

I need to get a count of the subsystem for each suite as per the name_heuristic (there are 4 values for that column). So I need an output with columns for each type of name_heuristic with the suite as index and values will be count of subsystems as per each column.

I have tried using pivot_table with the following code:

df_table = pd.pivot_table(df, index='suite', columns='name_heuristics', values='subsystem', aggfunc=np.sum

But even after an HOUR, it is not done computing. What is taking so long and how can I speed it up? I even tried a groupby alternative that is still running 15 minutes and counting:

df_table = df.groupby(['name_heuristics', 'suite']).agg({'subsystem': np.sum}).unstack(level='name_heuristics').fillna(0)

Any help is greatly appreciated! I have been stuck on this for hours.

Krithika Raghavendran
  • 457
  • 3
  • 10
  • 25
  • Could you please add [MWE](https://stackoverflow.com/help/minimal-reproducible-example) (for pandas there are good hints in the answers in [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples))? – sophros Sep 29 '20 at 09:13
  • your data set is to large for Pandas, you need to use a relational dbms (SQL) or something like Dask or Pyspark. – Umar.H Sep 29 '20 at 09:15
  • @Manakin, this [question](https://stackoverflow.com/questions/55404617/faster-alternatives-to-pandas-pivot-table) claims they are able to do it in mere seconds – Krithika Raghavendran Sep 29 '20 at 09:21
  • 1. all the data in that question is categorical, 2. your data is not like. 3. Pandas is memory dependent, they may have run that on a 128 GB ram machine – Umar.H Sep 29 '20 at 09:47

1 Answers1

-1

It seems pivoting more than one categorical column crashes pandas. My solution to a similar problem was converting categorical to object for the target columns, using

  1. step 1

     df['col1'] = df['col1'].astype('object')
    
     df['col2'] = df['col2'].astype('object')
    
  2. step 2

     df_pivot = pandas.pivot_table(df, columns=['col1', 'col2'], index=...
    

This was independent of dataframe size...

f stangler
  • 31
  • 5