1

I have a df with two columns x and y . Column y is cum count of x values. x values have different counts. How do I get a result df of top two y counts for each x without iterating through rows.

Example df:

df = pd.DataFrame({"x": [101, 101, 101, 101, 201, 201, 201, 405, 405], "y": [1, 2, 3, 4, 1, 2, 3, 1, 2]})

     x  y
0  101  1
1  101  2
2  101  3
3  101  4
4  201  1
5  201  2
6  201  3
7  405  1
8  405  2

Desired result:

x      y
101    3
101    4
201    2
201    3
405    1
405    2
ayhan
  • 70,170
  • 20
  • 182
  • 203
obabs
  • 8,671
  • 4
  • 12
  • 17
  • http://stackoverflow.com/a/32397818/2285236 – ayhan Apr 28 '16 at 14:15
  • 1
    do you really need a separate `y` column? can't you just use `df.x.count` to find out what the greatest y value would be. And then do some kind of masked duplication to get n-1. – dan-man Apr 28 '16 at 14:20

2 Answers2

1

You can do it this way:

In [35]:
df.loc[df.groupby(['x'])['y'].apply(lambda x: x.iloc[-2:]).index.get_level_values(1)]

Out[35]:
     x  y
2  101  3
3  101  4
5  201  2
6  201  3
7  405  1
8  405  2

So this groupby on 'x' column and returns the last 2 values, assuming that the df is already sorted as you've shown. This produces a df with a multindex and the second level values can be used to index back into the original df by using get_level_values

EDIT

To answer your comment you can groupby again and use transform with rank to reset the values to 1 and 2:

In [51]:
df1 = df.loc[df.groupby(['x'])['y'].apply(lambda x: x.iloc[-2:]).index.get_level_values(1)]
df1['y'] = df1.groupby('x')['y'].transform(lambda x: x.rank(method='first'))
df1
​
Out[51]:
     x  y
2  101  1
3  101  2
5  201  1
6  201  2
7  405  1
8  405  2
EdChum
  • 376,765
  • 198
  • 813
  • 562
0

Here is a solution if your dataframe is not sorted:

In [1]: df.groupby('x')['y'].nlargest(2)
Out[1]: 
x     
101  3    4
     2    3
201  6    3
     5    2
405  8    2
     7    1
dtype: int64

Unfortunately nlargest cannot be applied to a grouped-by dataframe, so there is some reformatting to do.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • You can do the same thing as I proposed, call `get_level_values(1)` on the index and use this to index back into the original df – EdChum Apr 28 '16 at 14:26