2

My question seems duplicate as I found different questions with the same error as follows:

Pandas: grouping a column on a value and creating new column headings

Python/Pandas - ValueError: Index contains duplicate entries, cannot reshape

Pandas pivot produces "ValueError: Index contains duplicate entries, cannot reshape

I tried all the solutions presented on those posts, but none worked. I believe the error maybe be caused by my dataset format, which has Strings instead of numbers and possible duplicate entires. Here follows an example of my Dataset:

protocol_no activity description
1586212 walk twice a day
1586212 drive 5 km
1586212 drive At least 30 min
1586212 sleep NaN
1586212 eat 1500 calories
2547852 walk NaN
2547852 drive NaN
2547852 eat 3200 calories
2547852 eat Avoid pasta
2547852 sleep At least 10 hours

The output I'm trying to achieve is:

protocol_no walk drive sleep eat
1586212 twice a day 5km NaN 1500 calories
2547852 NaN NaN 3200 calories At least 10 hours

I tried using pivot and pivot_table with a code like this:

df.pivot(index="protocol_no", columns="activity", values="description")

But I'm still getting this error:

ValueError: Index contains duplicate entries, cannot reshape

Have no idea what is going wrong, so any help will be helpful!

EDIT:

I noticed my data contains duplicate entires as stated by the error and by @DYZ and @SeaBean users. So I've edited the database example and provided the correct answer for my dataset as well. Hope it helps someone.

  • 1
    I executed your pivot statement on the provided data and got no error messages. Have you used the same data for your experiment? – DYZ Oct 07 '21 at 16:00
  • Not really. My data is strictly confidential, so I can't share it. I provided a similar example, but as you have stated, it wasn't enough. – Guilherme Noronha Oct 07 '21 at 17:08

2 Answers2

6

Try using .piviot_table() with aggfunc='first' (or something similar) if you get duplicate index error when using .pivot()

df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc='first')

This is a common situation when the column you set as index has duplicated values. Using aggfunc='first' (or sometimes aggfunc='sum' depending on condition) most probably can solve the problem.

Result:

activity    drive            eat              sleep         walk
protocol_no                                                     
1586212      5 km  1500 calories                NaN  twice a day
2547852       NaN  3200 calories  At least 10 hours          NaN

Edit

Based on your latest edit with duplicate entries, you can just modify the solution above by changing the aggfunc function above, as follows:

df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc=lambda x: ' '.join(x.dropna()))

Here, we change the aggfunc from 'first' to lambda x: ' '.join(x.dropna()). It achieves the the same result as your desired output without adding multiple lines of codes.

Result:

activity                    drive                        eat              sleep         walk
protocol_no                                                                                 
1586212      5 km At least 30 min              1500 calories                     twice a day
2547852                            3200 calories Avoid pasta  At least 10 hours             
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Your solution worked on my data. Could you please explain more about what "first" does on aggregation? – Guilherme Noronha Oct 07 '21 at 17:37
  • 2
    @GuilhermeNoronha Right, this is expected. Most of the time `first` works well in this case. The `first` as aggregation function here is somewhat the same as the `GroupBy.first`(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.first.html) function and sometimes we use `pivot_table` interchangeably with `groupby()` function as well. Here, `first` is to get the first in group in case of multiple (duplicated) values in group. – SeaBean Oct 07 '21 at 18:16
  • 1
    @GuilhermeNoronha The group here is the `protocol_no`, `activity`, `description` combination. If you have description for a particular `protocol_no` for an `activity`, then it will get the only `description`. (This is the most common use case). In case you have multiple descriptions for one `protocol_no` and `activity` combination, it will take the first description. As this case is relatively much rarer, `first` most probably can get our desired results. – SeaBean Oct 07 '21 at 18:21
  • Hi @GuilhermeNoronha Seen your edit mentioning your data has duplicate entries. So, how would you like to deal with it ? One common way to solve this is to aggregate the multiple entries into one. Let me know if you want to do so. I can give you some hint on slightly fine-tuning the code above to consolidate/aggregate the duplicate entries into one. – SeaBean Oct 08 '21 at 15:27
  • Hi SeaBean. I just posted an answer based on the insights you and DYZ gave to me. I used the groupby to aggregate the duplicated entries and pivoted later. Anyway, thank you very much. The insights you gave to me was essential to find the best solution. :) – Guilherme Noronha Oct 08 '21 at 15:35
  • @GuilhermeNoronha I seen your solution but it seems a little bit clumsy (sorry to say so). You can actually do it much simpler. I am editing my solution with just little change to achieve the same result you want. You will see it soon. – SeaBean Oct 08 '21 at 15:38
  • 1
    @GuilhermeNoronha See my edit above. Here's the power of using `.pivot_table()` with `aggfunc`. It can achieve quite a lot of features/functions you want. Try the new code and let me know how you think of it. :-) – SeaBean Oct 08 '21 at 15:45
  • Indeed your solutions is much better than mine. I'll going to accept it. Thank you again for taking your time to help me. :) – Guilherme Noronha Oct 08 '21 at 16:58
  • @GuilhermeNoronha You're welcome! You could have asked me whether could fine-tune my solution for your new finding of duplicate entries. Then you could save your time devising the codes. Anyway, it's still good you have taken a chance to think it by yourself. Overall, you should have learned more by doing this exercise! Happy programming and have a nice day! :-) – SeaBean Oct 08 '21 at 17:02
  • 1
    Of course, I learned! And learned much more with your solution as well... :) – Guilherme Noronha Oct 08 '21 at 17:14
0

Although the SeaBean answer worked on my data, I took a look into my data and noticed it really contained duplicated entires (as the example in my question I edited later). To deal with this, the best solution is to do a join with those duplicate entries.

1- Before the join, I needed to remove the NaNs of my Dataset. Otherwise it will raise another error:

df["description"].fillna("", inplace=True)

2- Then I executed the grouby function joining the duplicate entries:

df = df.groupby(["protocol_no", "activity"], as_index=False).agg({"description": " ".join})

3- The last, but not the least, I executed the pivot as I have intended to do in my question:

df.pivot(index="protocol_no", columns="activity", values="description")

4- Voilà, the result:

protocol_no drive eat sleep walk
1586212 5 km At least 30 min 1500 calories twice a day
2547852 3200 calories Avoid pasta At least 10 hours

5- The info of my dataset using df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1586212 to 2547852
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   drive   2 non-null      object
 1   eat     2 non-null      object
 2   sleep   2 non-null      object
 3   walk    2 non-null      object
dtypes: object(4)
memory usage: 80.0+ bytes   

Hope It helps someone and many thanks to SeaBean and DYZ insights. :)