6

(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)

Suppose you start with a DataFrame

df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]: 
   a  b  val
0  x  0    0
1  x  1    1
2  y  0    2
3  y  1    3

Now suppose you want to make the index a, the columns b, the values in a cell val, and specify what to do if there are two or more values in a resulting cell:

b  0  1
a      
x  0  1
y  2  3

Then you can do this either through

df.val.groupby([df.a, df.b]).sum().unstack()

or through

pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')

So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.

Is there something I've missed?

  • Are there operations that can be performed using one and not the other?

  • Are there, perhaps, operations easier to perform using one over the other?

  • If not, why not deprecate pivot_tale? groupby seems much more general.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • pivot_table is probably an option for people who used Excel before. Being one of them, it certainly sounds more idiomatic to me. It also has options to add subtotals and grand totals. – ayhan Sep 24 '16 at 11:10
  • 1
    @ayhan (?! thought I posted a comment, but can't find it). Thanks! The Excel crossover point, in particular, makes a lot of sense. – Ami Tavory Sep 25 '16 at 06:10

1 Answers1

6

If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for:

df.groupby([index + columns]).agg(aggfunc).unstack(columns)

plus:

  • margins (subtotals and grand totals as @ayhan has already said)
  • pivot_table() also removes extra multi-levels from columns axis (see example below)
  • convenient dropna parameter: Do not include columns whose entries are all NaN

Demo: (I took this DF from the docstring [source code for pivot_table()])

In [40]: df
Out[40]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

pay attention at the top level column: D

In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
            D
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

why not deprecate pivot_tale? groupby seems much more general.

IMO, because it's very easy to use and very convenient! ;)

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419