3

I do not understand why this works

df[(df['Gold']>0) & (df['Gold.1']>0)].loc[((df['Gold'] - df['Gold.1'])/(df['Gold'])).abs().idxmax()]

but when I divide by (df['Gold'] + df['Gold.1'] + df['Gold.2']) it stops working giving me error that you can find below.

Interestingly, the following line works

df.loc[((df['Gold'] - df['Gold.1'])/(df['Gold'] + df['Gold.1'] + df['Gold.2'])).abs().idxmax()]

I do not understand what is happening since I just started to learn Python and Pandas. I need to understand the reason why this happens and how to fix it.

ERROR

KeyError: 'the label [Algeria] is not in the [index]'

DataFrame snap enter image description here

Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
YohanRoth
  • 3,153
  • 4
  • 31
  • 58
  • Try `print(df.index.tolist())`, you might have some spaces in there. – IanS Jan 02 '17 at 13:55
  • 1
    @MaharajaX: in the future please post a text sample of your dataframe so that we can play with it (or code to produce it), not a picture. See [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for example. Thanks, and good luck with your course ;) – Julien Marrec Jan 02 '17 at 19:52
  • The sample dataframe doesn't help much because the Winter medal counts (`Gold.1,Silver.1,Bronze.1,Total.1`)) for all countries are all zero. By the way I would have named those series `Gold.S, Gold.W, Gold` just to be clear. – smci Feb 22 '18 at 10:25
  • If you post us reproducible code and a dataset (or URL), we could reply. It's a nice question for practising good idiom on. The cause of your bug is "multiindexing", i.e. `df[...][...]` will result in the LHS expression giving you a copy, which the RHS expression then tries to process/modify, instead of working directly on the source df. `df.filter` might be a better way to go... – smci Feb 22 '18 at 10:28

1 Answers1

6

Your problem is boolean indexing:

df[(df['Gold']>0) & (df['Gold.1']>0)]

returns a filtered DataFrame which does not contain the index of max value of Series you calculated with this:

((df['Gold'] - df['Gold.1'])/(df['Gold'] + df['Gold.1'] + df['Gold.2'])).abs().idxmax()

In your data it is Algeria.

So loc logically throws a KeyError.

One possible solution is to assign the new filtered DataFrame to df1 and then get the index corresponding to the max value of Series by using idxmax:

df1 = df[(df['Gold']>0) & (df['Gold.1']>0)]
df2 = df1.loc[((df1['Gold']-df1['Gold.1'])/(df1['Gold']+df1['Gold.1']+df1['Gold.2'])).abs().idxmax()]
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I did not really get this "return df which not contains index of max value of Series:" So you are saying max value is not in data frame that is returned after boolean operation? I though we first perform boolean filter, then on what's filtered we find max value. Isn't it how it works? – YohanRoth Jan 02 '17 at 15:48
  • No, because although you filter it, you dont use filtered values in `((df['Gold'] - df['Gold.1'])/(df['Gold'] + df['Gold.1'] + df['Gold.2'])).abs().idxmax()` but original unfiltered. Btw, this is very hard debugging error, because sometimes it works nice - if filtered dataframe contains idxmax, but sometimes it failed if values are changed. If `Algeria` return `((df['Gold'] - df['Gold.1'])/(df['Gold'] + df['Gold.1'] + df['Gold.2'])).abs().idxmax()`, you can see `Gold.1==0`, so not `(df['Gold.1']>0)` – jezrael Jan 02 '17 at 15:54
  • hmm, thanks. That is so weird. What's even a point to allow writing like this when it brings so subtle errors and it does not work the way expected. I expected it to be evaluated from left to right. Instead it works so weirdly :( Anyway, thanks! – YohanRoth Jan 02 '17 at 15:58