1

I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.

Sample data

import pandas as pd
import numpy as np
import copy
import random

df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
    df_tidy = df_tidy.append(
    {
        'Stage': random.choice(['OP', 'FUEL', 'EOL']),
        'Exc': str(np.random.randint(low=0, high=1000)),
        'Cat': random.choice(['CC', 'HT', 'PM']),
        'Score': np.random.random(),
    }, ignore_index=True
)
df_tidy

returns

    Stage   Exc Cat Score
0   OP      929 HT  0.946234
1   OP      813 CC  0.829522
2   FUEL    114 PM  0.868605
3   OP      896 CC  0.382077
4   FUEL    10  CC  0.832246
5   FUEL    515 HT  0.632220
6   EOL     970 PM  0.532310
7   FUEL    198 CC  0.209856
8   FUEL    848 CC  0.479470
9   OP      968 HT  0.348093

I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:

Working but probably inefficient approach

new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
    for lcs, smaller_df in small_df.groupby('Stage'):
        new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df

Which returns what I want:

    OP      FUEL        EOL     Total
CC  1.2116  1.52157     NaN     2.733170
HT  1.29433 0.63222     NaN     1.926548
PM  NaN     0.868605    0.53231 1.400915

But I cannot believe this is the simplest or most efficient path.

Question

What pandas magic am I missing out on?

Update - Timing the proposed solutions

To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:

groupby solution, that I thought was inefficient:

%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
    for lcs, smaller_df in small_df.groupby('Stage'):
        new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)

41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:

%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]

67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

pivot_table solution:

%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]

713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, it would appear that the clunky groupbysolution is the quickest.

MPa
  • 1,086
  • 1
  • 10
  • 25
  • 1
    It seems illogical, that crosstabs is more performant than pivot_table, when the first calls the second under the hood. I ran your code and got groupby: "7.26 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)", crosstab: "25.3 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)" and pivot_table: "21.8 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)". – Redoute May 26 '21 at 05:26

2 Answers2

3

A simple solution from crosstab

pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]: 
Stage      EOL      FUEL        OP     Total
Cat                                         
CC         NaN  1.521572  1.211599  2.733171
HT         NaN  0.632220  1.294327  1.926547
PM     0.53231  0.868605       NaN  1.400915
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Accepting off the bat because, well, this is *exactly* what I was looking for. – MPa Jul 24 '18 at 17:02
  • Note however that this solution is marginally slower than the groupby solution I had (though less clunky) - see performance check in my question's edit. – MPa Jul 24 '18 at 18:56
1

I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:

pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
sophros
  • 14,672
  • 11
  • 46
  • 75
  • Indeed, why not? Are there any differences? – MPa Jul 24 '18 at 17:52
  • The answer to that question is here: ([https://stackoverflow.com/questions/36267745/how-is-a-pandas-crosstab-different-from-a-pandas-pivot-table). So, in this particular case, `pivot_table` is actually better (answer in link states "In general, use pivot_table if you already have a DataFrame, so you don't have the additional overhead of creating the same DataFrame again."). – MPa Jul 24 '18 at 18:34
  • I thought this warranted a performance check: pivot_table is actually the slowest of the three approaches I tested. Hard to see why... – MPa Jul 24 '18 at 18:57
  • @MPa: Could you please paste the performance comparison results? – sophros Jul 25 '18 at 18:12
  • Indeed. Sorry didn't notice before. – sophros Jul 26 '18 at 04:28