0

My question is very similar to one posted here: Determine mean value of ‘data’ where the highest number of CONTINUOUS cond=True

which has been beautifully solved by @Divakar. However, i have slightly different need. Instead of the highest maximum value, how can i get -say the 2nd highest, 3rd highest, etc. ranking?

One workaround is to reduce the original array successively with values that have been ranked and run @Divakar solution iteratively, but i was wondering if anybody has a more efficient solution

Community
  • 1
  • 1
Siraj S.
  • 3,481
  • 3
  • 34
  • 48

2 Answers2

0

Instead of

res[res['count'] == res['count'].max()]

You can use

res[res['count'] == res['count'].sort_values().unique()[-n]]

For whatever n you want.

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • This doesn't answer the question specifics. It is asking for only those groups of rows that have continuous straks of True values. – Ted Petrou Dec 16 '16 at 03:58
  • @TedPetrou I was being lazy and just telling him what he needs to change in the linked question – maxymoo Dec 16 '16 at 04:51
0

You need to attach a group number to each continuous set of rows. Some tricky logic will show that the following is accomplished with the following statement. The logic is based on using diff to find each occurrence of a 1 preceded by a 0.

# create fake data
import pandas as pd
import numpy as np

np.random.seed(2)
df = pd.DataFrame({'cond':np.random.choice([True, False], 100), 
                   'data':np.random.rand(100)})

print(df.head(15))

     cond      data
0    True  0.544208
1   False  0.082095
2   False  0.366342
3    True  0.850851
4    True  0.406275
5   False  0.027202
6    True  0.247177
7   False  0.067144
8    True  0.993852
9   False  0.970580
10   True  0.800258
11  False  0.601817
12  False  0.764960
13  False  0.169225
14  False  0.293023

# add a group column
df['group'] = ((np.where(df.cond.diff() > 0, 1, 0) * df.cond).cumsum() +
                df.cond.iloc[0]) * df.cond

print(df.head())

     cond      data  group
0    True  0.544208      1
1   False  0.082095      0
2   False  0.366342      0
3    True  0.850851      2
4    True  0.406275      2
5   False  0.027202      0
6    True  0.247177      3
7   False  0.067144      0
8    True  0.993852      4
9   False  0.970580      0
10   True  0.800258      5
11  False  0.601817      0
12  False  0.764960      0
13  False  0.169225      0
14  False  0.293023      0

# get answer with grouping and sorting
df_final = df.query('group != 0').groupby('group')['data']\
                                 .agg(['count', 'mean'])\
                                 .sort_values('count', ascending=False)

print(df_final.head())
       count      mean
group                 
18        10  0.529819
11         8  0.630232
10         4  0.301558
16         4  0.215376
6          4  0.563013

# get result with .iloc
# if you care about ties you can select those rows that all have
# the third highest count
df.iloc[[2]]
       count      mean
group                 
10         4  0.301558
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136