Pandas DataFrame won't pivot. Says duplicate indices

Question

So basically I have 3 columns in my dataframe as follows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158143 entries, 0 to 203270
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   users          158143 non-null  int64         
 1   dates          158143 non-null  datetime64[ns]
 2   medium_of_ans  158143 non-null  object

And I want it to be reshaped such that each entry in medium_of_ans value has a separate column and dates as row indices with users of a particular answer medium on a particular date resides in the junction of that row and column. In pandas similar functionality can be achieved by pivoting the dataframe although I am not able to achieve that as following attempt:

df.pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')

throws this error:

ValueError: Index contains duplicate entries, cannot reshape

And I'm not sure why as a dataframe to be pivoted will obviously have duplicates in indices. That's why it is being pivoted. Resetting dataframe index as follows:

df.reset_index().pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')

does not help either and error persists.

You need some sort of aggregation for duplicate records. – Scott Boston Apr 14 '20 at 02:44 — Scott Boston, Apr 14 '20 at 02:44
https://stackoverflow.com/q/47152691/6361531 – Scott Boston Apr 14 '20 at 02:45 — Scott Boston, Apr 14 '20 at 02:45

score 1 · Accepted Answer · answered Apr 14 '20 at 03:03

You have duplicates not just by the index, dates, but by the combination of index and column together, the combined dates and medium_of_ans.

You can find these duplicates with something like this:

counts = df.groupby(['dates', 'medium_of_ans']).size().reset_index(name='n')
duplicates = counts[counts['n'] > 1]

If you want to combine the duplicates, for example by taking the mean of users for the cell, then you can use pivot_table.

df.pivot_table(columns='medium_of_ans', index='dates', values='users', aggfunc='mean')

Taking the mean is the default, but I have added the explicit parameter for clarity.

Thanks! this is what I was looking for. Though sum is the method for me — Hamza, Apr 14 '20 at 03:25

Pandas DataFrame won't pivot. Says duplicate indices

1 Answers1