3

I have a df of the form (just a sample):

Name     Event    Score
Joe      1        0.9
Tim      8        0.4
Ben      3        0.7
Ben      9        0.1
Tim      2        0.6
Ben      5        1.1
Joe      2        1.2
Tim      6        0.9
Tim      3        0.7
Joe      7        0.6
Joe      6        0.8

My first step was to groupby and select the 2 smallest scores for each person. I was able to do this with the following code:

top_2 = df.groupby('Name')['Score'].apply(lambda x: x.nsmallest(2)).reset_index('Name')
df = df[(df.index.isin(top_2.index))].reset_index(drop=True)

Which gives the results:

Name     Event    Score
Tim      8        0.4
Ben      3        0.7
Ben      9        0.1
Tim      2        0.6
Joe      7        0.6
Joe      6        0.8

However, now I would like to make sure that each person does not have consecutive events. For example, Joe has events 6 and 7. Instead, I would like to choose event 7 (since it is smaller than 6) and event 1 (Joe's next smallest event, so that he still has 2 events).

I believe I will need to do this as a step prior to my current code but I'm unsure how to approach it.

A couple of notes:

  • Events are not repeated for a person
  • Names are unique
  • The number of events for a person in the raw data could be anywhere > 1
Bill K
  • 79
  • 4

2 Answers2

2

A little bit work here , create the sub group with diff and cumsum, if they are continue they belong to one group , and select min of each subgroup, then , the output we select the min of two

df=df.sort_values(['Name','Event'])
s=df.groupby('Name')['Event'].apply(lambda x : x.diff().ne(1).cumsum())
newdf=df.groupby([df.Name,s]).apply(lambda x : x.nsmallest(1,'Score'))
newdf=newdf.sort_values('Score').groupby(newdf['Name']).head(2).reset_index(drop=True)
newdf
  Name  Event  Score
0  Ben      9    0.1
1  Tim      8    0.4
2  Joe      7    0.6
3  Tim      2    0.6
4  Ben      3    0.7
5  Joe      1    0.9
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Drop consecutive duplicates as explained here: Pandas: Drop consecutive duplicates This is the preprocessing step.

a = df['Name']

# mask duplicates 
mask = (a.shift(-1) == a) | (a.shift(1) == a)
# mask the head of each duplicated group including 1-element group
mask_head = (mask & (a.shift(1) != a)) | ~mask
df['dupl_id'] = np.NaN
# set id of each group as index of the head
df.loc[mask_head, 'dupl_id'] = df[mask_head].index
# set the same id for the whole group
df.loc[mask, 'dupl_id'] = df.loc[mask, 'dupl_id'].ffill()

df = df.sort_values(['Name', 'Score']).groupby('dupl_id').apply(lambda g: g.drop_duplicates('Name', keep='first'))
Oleg O
  • 1,005
  • 6
  • 11