0

I am trying to do something like an SQL window function in Python 3.6. I have created the following code which I found here, but I'm getting the following error:

"ValueError: cannot reindex from a duplicate axis"

df = pd.DataFrame({'id' : ['daeb21718d5a','daeb21718d5a','daeb21718d5a'],
           'product_id' : [123,456,789],
           'probability' : [0.076838,0.053384, 0.843900 ]})

df['rank'] = df.sort_values(['probability'], ascending=False) \
               .groupby(['id']) \
               .cumcount() + 1

Weirdly if I add .reset_index(drop=True) before grouping, the error is fixed.

Mewtwo
  • 1,231
  • 2
  • 18
  • 38
  • Possible duplicate of [What does \`ValueError: cannot reindex from a duplicate axis\` mean?](https://stackoverflow.com/questions/27236275/what-does-valueerror-cannot-reindex-from-a-duplicate-axis-mean) – sophros Mar 29 '19 at 16:46
  • What is your expected output? Which column are you trying to give a rank? – Erfan Mar 29 '19 at 16:47
  • what Pandas version are you using? Your code works just fine for me (Pandas 0.24.2) – MaxU - stand with Ukraine Mar 29 '19 at 16:55
  • 1
    I cannot reproduce your error, your code works fine on my machine. (`pandas.__version__` is 0.23.4) – Niklas Mertsch Mar 29 '19 at 16:55
  • @MaxU I am using exactly the same version as you. – Mewtwo Mar 29 '19 at 17:01
  • @lacrima, and you are getting mentioned error when executing the code from your question?? Try to open a new Python terminal/iPython/Jupyter and execute the code from your question... – MaxU - stand with Ukraine Mar 29 '19 at 17:03
  • Still getting the same error. The weird thing is that if I remove the sort_values and use rank instead, like@Erfan suggested, it works. So I guess there is something weird with the sort_values method – Mewtwo Mar 30 '19 at 13:43
  • @lacrima, no wonder you didn't get really helpful answer - you didn't provide a __reproducible__ sample data set. – MaxU - stand with Ukraine Mar 30 '19 at 14:52

1 Answers1

2

If I understand you correctly, you are trying to rank probability descending based on each group of id.

You were almost there, the following will solve your problem:

df['rank'] = df.sort_values(['probability'], ascending=False) \
             .groupby(['id'])['probability'] \
             .cumcount() + 1

print(df)
             id  product_id  probability  rank
0  daeb21718d5a         123     0.076838     2
1  daeb21718d5a         456     0.053384     3
2  daeb21718d5a         789     0.843900     1

Or we can use pandas .rank method for this:

df['rank'] = df.groupby('id')['probability'].rank(ascending=False)

print(df)
             id  product_id  probability  rank
0  daeb21718d5a         123     0.076838   2.0
1  daeb21718d5a         456     0.053384   3.0
2  daeb21718d5a         789     0.843900   1.0
Erfan
  • 40,971
  • 8
  • 66
  • 78