53

On Pandas documentation of the pivot method, we have:

Examples
--------
>>> df
    foo   bar  baz
0   one   A    1.
1   one   B    2.
2   one   C    3.
3   two   A    4.
4   two   B    5.
5   two   C    6.

>>> df.pivot('foo', 'bar', 'baz')
     A   B   C
one  1   2   3
two  4   5   6

My DataFrame is structured like this:

   name   id     x
----------------------
0  john   1      0
1  john   2      0
2  mike   1      1
3  mike   2      0

And I want something like this:

      1    2   # (this is the id as columns)
----------------------
mike  0    0   # (and this is the 'x' as values)
john  1    0

But when I run the pivot method, it is saying:

*** ReshapeError: Index contains duplicate entries, cannot reshape

Which doesn't makes sense, even in example there are repeated entries on the foo column. I'm using the name column as the index of the pivot, the first argument of the pivot method call.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Tarantula
  • 19,031
  • 12
  • 54
  • 71

6 Answers6

97

As far as I can tell with updates to pandas, you have to use pivot_table() instead of pivot().

pandas.pivot_table(df,values='count',index='site_id',columns='week')
eduffy
  • 39,140
  • 13
  • 95
  • 92
Alison S
  • 1,759
  • 1
  • 14
  • 28
  • I had the same problem, and your answer solved it for me (in contrast to the others). I am using Pandas 0.12.0. – hans_meine Nov 21 '13 at 11:42
  • DataFrame.duplicated() seems to return non-duplicated rows; it's ignoring my date column (i.e. when you have duplicated measurements on different dates, duplicated() claims that they are duplicates, even though they have different dates.) – John Prior Sep 11 '14 at 22:05
  • 7
    What's the difference? – crhodes Nov 17 '15 at 10:30
  • This answer is barking up the wrong tree. pivot_table makes a multiindex object, rather than a normal table. – foobarbecue Oct 30 '17 at 01:42
  • 2
    I provide several detailed examples and alternatives in this [**Q&A**](https://stackoverflow.com/q/47152691/2336654) – piRSquared Nov 11 '17 at 22:15
34

Try this,

#drop_duplicates removes entries which have same values for 'foo' and 'bar'
df = df.drop_duplicates(['foo','bar'])
df.pivot('foo','bar','baz')
2ank3th
  • 2,948
  • 2
  • 21
  • 35
Tom
  • 341
  • 3
  • 4
  • Great catch! I Actually just had to drop my duplicates. Thank you! – Nico Oct 01 '16 at 13:54
  • 8
    I think this is probably the correct answer for most people visiting here. People are confused because they think pandas is demanding a unique index. Actually pandas is demanding that index and columns *unique together* i.e. there are no rows in the original frame where *both* index and the columns column are repeated. – foobarbecue Oct 30 '17 at 01:44
  • this answer (and foobarbeque's elaboration) were super helpful to me in understanding that pandas error. Just want to add though that in my case (and in my experience, most cases), doing a `drop_duplicates` would remove the error and hide the problem, but not actually fix the problem. generally you'd do an operation like this when you're assuming your index/cols are unique identifiers of some type of record you're pivoting, so there's probably a bug upstream somewhere if you're reaching for drop-duplicates here. Again, at least in my experience anyway – Max Power Feb 23 '18 at 03:51
  • @foobarbecue, very nice comment! Now I wonder why is it that the error message is focused on `index` only and not no the pair `index,column` ... nevertheless, thanks! – Lucas Aimaretto Jan 22 '19 at 15:54
6

Works fine for me? Can you post the exact pivot method call you're using?

In [4]: df.pivot('name', 'id', 'x')
Out[4]: 
id    1  2
name      
john  0  0
mike  1  0
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 3
    I have lots of data, this is a little functional example, I'll try to use the original data, but the error message is by itself nonsensical, because the index CAN have duplicate entries, what you think ? – Tarantula Jun 29 '12 at 02:38
  • 1
    If you have duplicates you may need to aggregate first. It would be nice to add an option to pivot to take either the first or last observed entry: http://github.com/pydata/pandas/issues/1865 – Wes McKinney Sep 08 '12 at 23:11
  • You shouldn't have this problem with python3 – Alex Nov 02 '16 at 13:07
  • 6
    I still have this problem with python 3. – foobarbecue Oct 30 '17 at 01:31
5

As pointed out by several answers, the problem isn't that you have duplicate values in your index (the error message certainly does not help here), but rather that you have duplicates of (index, column). Some answers are suggesting that you just drop these duplicates but I'd be careful about doing so - in my experience this is rarely the correct choice. More often than not, you probably want to aggregate your data somehow, and then pivot.

I'm pulling some examples and quotes from this blog post, which I suggest you read for more details, below.

Given data like this:

df = pd.DataFrame([
    ['a', 'x', 1],
    ['a', 'x', 2],
    ['b', 'x', 3],
    ['b', 'y', 4]
], columns=['g1', 'g2', 'value'])

which prints like this:

>>> print(df)
  g1 g2  value
0  a  x      1
1  a  x      2
2  b  x      3
3  b  y      4

we get a ValueError when attempting to pivot with g1 as the index and g2 as the columns:

>>> df.pivot(index='g1', columns='g2', values='value')
...
ValueError: Index contains duplicate entries, cannot reshape

Notice that rows 0 and 1 have the same values for g1 and g2: (a, x). So when pandas creates your pivoted dataframe, for the a index, g1 column, how do pick just one value: 1 or 2? The answer is... we can't! This is why dropping duplicates works, but it may not be you want, since you're losing potentially useful data. So what can we do instead?

Solution 1: Aggregate

There won't always be an aggregate function that makes sense for your use case, but if there is, there are several ways to accomplish this.

df.pivot_table(index='g1', columns='g2', values='value', aggfunc='sum')
df_agg = df.groupby(by=['g1', 'g2']).value.sum().reset_index()
df_agg.pivot(index='g1', columns='g2', values='value')
df.groupby(by=['g1', 'g2']).value.sum().unstack()

All of these yield the same result:

g2    x    y
g1          
a   3.0  NaN
b   3.0  4.0

But what if you don't need the sum? Maybe comma separated values are useful in your case?

df.pivot_table(
    index='g1',
    columns='g2',
    values='value',
    aggfunc=lambda x: ','.join(x.astype('str'))
)
# we need to convert to strings before we can join

to get:

g2    x    y
g1          
a   1,2  NaN
b     3    4

or you can use list as your aggfunc:

pv = df.pivot_table(index='g1', columns='g2', values='value', aggfunc=list)

and then we can explode!

>>> pv.explode('x').explode('y')
g2  x    y
g1        
a   1  NaN
a   2  NaN
b   3    4

Solution 2: Give yourself another key

This is based on this answer

>>> df['key'] = df.groupby(['g1', 'g2']).cumcount()
>>> df
  g1 g2  value  key
0  a  x      1    0
1  a  x      2    1
2  b  x      3    0
3  b  y      4    0

and now we can pivot with a composite index:

>>> df.pivot(index=['key', 'g1'], columns='g2', values='value').reset_index().drop(columns='key')
g2 g1    x    y
0   a  1.0  NaN
1   b  3.0  4.0
2   a  2.0  NaN

This is almost the same result as the exploded example above, just a set_index('g1') away.

Hope this helps! I hit this problem quite often and usually forget all of this..

Radu
  • 8,561
  • 8
  • 55
  • 91
3

friends, I've had such problem. In my case problem was in data - my column 'information' contained 1 unique value and it caused error.

UPD: to correct work 'pivot' pairs (id_user,information) mustn't have dublicates

It works:

df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5], 
'information':['phon','phon','phone','phone1','phone','phone1','phone'], 
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')

it doesn't work:

df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5], 
'information':['phone','phone','phone','phone','phone','phone','phone'], 
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')
Grag2015
  • 591
  • 9
  • 15
1

My data has no duplicated pivot pairs and still pivot_table throws a key error :( drop_duplicates() truncates my data to the first value of the pair.

Input:

Well    Reading     Filter 4
0   A2  1   116041
1   B2  1   105191
2   C2  1   93942
3   D2  1   96821
4   E2  1   85622
5   F2  1   90227
6   G2  1   95801
7   H2  1   107833
8   A2  2   115765
9   B2  2   104395
10  C2  2   93986
...
1630    G2  204     388682
1631    H2  204     444708

1632 rows × 3 columns

df_X2.pivot_table('Reading', 'Well', 'Filter 4')

throws: KeyError: 'Reading'

df_X2_uniq=df_X2.drop_duplicates(['Well', 'Reading']) truncates the data to the first 8 rows:

    Well    Reading     Filter 4
0   A2  1   116041
1   B2  1   105191
2   C2  1   93942
3   D2  1   96821
4   E2  1   85622
5   F2  1   90227
6   G2  1   95801
7   H2  1   107833

After 2 hours of combing through the posts I'm none the wiser... any hints of what I should try to get the pivot to work?

user1255933
  • 212
  • 2
  • 10