2

I'm using crosstab to sum the sales in given areas by the Publisher. The original dataframe looks like this:

Publisher   NA_Sales    EU_Sales    JP_Sales
1   Nintendo    29.08   3.58    6.81
2   Nintendo    15.68   12.76   3.79
3   Nintendo    15.61   10.93   3.28
4   Nintendo    11.27   8.89    10.22
5   Nintendo    23.20   2.26    4.22

I did it with pivot table now I want to do it using crosstab.

salespivot1=pd.pivot_table(df, index=df.Publisher,
    aggfunc=np.sum).sort_values('NA_Sales', ascending=False)

creates:

          EU_Sales  JP_Sales    NA_Sales
Publisher           
Nintendo    390.05  454.38  775.61
Electronic Arts 373.91  14.35   599.50
Activision  215.90  6.71    432.59
Sony Computer Entertainment 186.56  74.15   266.17
Ubisoft 161.99  7.52    252.74

But using crosstab I cant recreate this dataframe, because it stacks EU_Sales on top of the NA_Sales no matter what I do

salespivot3=pd.crosstab(index=df.Publisher, columns=['NA_Sales', 'EU_Sales'],
    values=df.NA_Sales, aggfunc=sum)

creates:

col_0   NA_Sales
col_1   EU_Sales
Nintendo    775.61
Electronic Arts  599.50
Activision    432.59
Sony Computer Entertainment  266.17
Ubisoft    252.74

How can I recreate the dataframe with crosstab to give same results as pivot?

smci
  • 32,567
  • 20
  • 113
  • 146
Alex T
  • 3,529
  • 12
  • 56
  • 105
  • can you post your desired data set / DF? – MaxU - stand with Ukraine Mar 18 '17 at 11:45
  • It comes from this site: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings – Alex T Mar 18 '17 at 11:45
  • I don't understand what does it have to do with the `crosstab()` function. Are you after: `df.groupby('Publisher', as_index=False).sum()`? – MaxU - stand with Ukraine Mar 18 '17 at 11:54
  • Okay so Im learning how gropuby, pivot table and crosstab work. And from what Ive understood they should produce similar results when coded properly, but I cant recreate the dataframe with crosstab to be same as with groupby or pivot_table. Could you help me with that? – Alex T Mar 18 '17 at 11:56
  • Related: [How is a Pandas crosstab different from a Pandas pivot_table?](https://stackoverflow.com/questions/36267745/how-is-a-pandas-crosstab-different-from-a-pandas-pivot-table) – smci Jul 15 '18 at 10:13

2 Answers2

8

It's not possible to use pd.crosstab() directly on your current DF unless you reshape them from wide to a long format so that the resulting headers would later serve as subsequent parameters to be passed into it's function call.

Here's a slight hack:

idx = ["Publisher"]
d = pd.melt(df, id_vars=idx)
pd.crosstab(d.Publisher, d.variable, d.value, aggfunc="sum", rownames=idx, colnames=[None])

enter image description here

But honestly, you should be using either a groupby/pivot_table approach which is designed for this exact purpose.

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 1
    Thanks, exactly what I was looking for. I probably wont ever use this method for that kind of grouping but I wanted tangible proof that it somehow can be used in similar way as the other two, but with some differences which now I understand. – Alex T Mar 18 '17 at 12:21
3

I think your are "misusing" pivot_table and crosstab methods.

Pivot methods suppose to transform your data set from long format to wide format.

Here is a small demo:

Source DF:

In [42]: df
Out[42]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

pivot_table usage:

In [43]: df.pivot_table(values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')
Out[43]:
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

now the same result using pd.crosstab() method:

In [44]: pd.crosstab(index=[df.A,df.B], columns=df.C, values=df.D, aggfunc='sum')
Out[44]:
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

for your sample data set you simply want to groupby + sum:

In [46]: df
Out[46]:
  Publisher  NA_Sales  EU_Sales  JP_Sales
1  Nintendo     29.08      3.58      6.81
2  Nintendo     15.68     12.76      3.79
3  Nintendo     15.61     10.93      3.28
4  Nintendo     11.27      8.89     10.22
5  Nintendo     23.20      2.26      4.22

In [47]: df.groupby('Publisher', as_index=False).sum()
Out[47]:
  Publisher  NA_Sales  EU_Sales  JP_Sales
0  Nintendo     94.84     38.42     28.32

UPDATE: if you by all means want to generate the same data set using crosstab method you can do it this way:

In [63]: x = df.set_index('Publisher').stack().reset_index(name='val')

In [64]: x
Out[64]:
   Publisher   level_1    val
0   Nintendo  NA_Sales  29.08
1   Nintendo  EU_Sales   3.58
2   Nintendo  JP_Sales   6.81
3   Nintendo  NA_Sales  15.68
4   Nintendo  EU_Sales  12.76
5   Nintendo  JP_Sales   3.79
6   Nintendo  NA_Sales  15.61
7   Nintendo  EU_Sales  10.93
8   Nintendo  JP_Sales   3.28
9   Nintendo  NA_Sales  11.27
10  Nintendo  EU_Sales   8.89
11  Nintendo  JP_Sales  10.22
12  Nintendo  NA_Sales  23.20
13  Nintendo  EU_Sales   2.26
14  Nintendo  JP_Sales   4.22

In [65]: pd.crosstab(index=x.Publisher, columns=x.level_1, values=x.val, aggfunc='sum')
Out[65]:
level_1    EU_Sales  JP_Sales  NA_Sales
Publisher
Nintendo      38.42     28.32     94.84

NOTE: we had first to convert your original DF from wide to long format and then back to wide format using pd.crosstab

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I understand but according to this Question: http://stackoverflow.com/questions/30679467/pivot-tables-or-group-by-for-pandas it is possible to use the 3 interchangably and I was actually able to use pivot table and groupby to produce proper results. However it didnt workout with crosstab, so Im not sure if it is wrong with the code or its not possible at all – Alex T Mar 18 '17 at 12:12