0

Consider the following DataFrame

>>> df
   Start  End  Tiebreak
0      1    6  0.376600
1      5    7  0.050042
2     15   20  0.628266
3     10   15  0.984022
4     11   12  0.909033
5      4    8  0.531054

Whenever the [Start, End] intervals of two rows overlap I want the row with lower tiebreaking value to be removed. The result of the example would be

>>> df
   Start  End  Tiebreak
2     15   20  0.628266
3     10   15  0.984022
5      4    8  0.531054

I have a double-loop which does the job inefficiently and was wondering whether there exists an approach which exploits built-ins and works columnwise.

import pandas as pd
import numpy as np

# initial data
df = pd.DataFrame({
    'Start': [1, 5, 15, 10, 11, 4],
    'End': [6, 7, 20, 15, 12, 8],
    'Tiebreak': np.random.uniform(0, 1, 6)
})

# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
    for j in range(i + 1, len(df)):
        idx_1 = df.index[i]
        idx_2 = df.index[j]

        cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
        cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
        
        # if rows overlaps
        if cond_1 & cond_2:
            tie_1 = df.loc[idx_1, 'Tiebreak']
            tie_2 = df.loc[idx_2, 'Tiebreak']

            # delete row with lower tiebreaking value
            if tie_1 < tie_2:  
                df.drop(idx_1, inplace=True)
            else:
                df.drop(idx_2, inplace=True)
clueless
  • 313
  • 1
  • 10

2 Answers2

1

You could sort by End and check cases where the end is greater than the previous Start. Using that True/False value, you can create groupings on which to drop duplicates. Sort again by Tiebreak and drop duplicates on the group column.

import pandas as pd

df = pd.DataFrame({'Start': {0: 1, 1: 5, 2: 15, 3: 10, 4: 11, 5: 4}, 'End': {0: 6, 1: 7, 2: 20, 3: 15, 4: 12, 5: 8}, 'Tiebreak': {0: 0.3766, 1: 0.050042, 2: 0.628266, 3: 0.984022, 4: 0.909033, 5: 0.531054}})

df = df.sort_values(by='End', ascending=False)

df['overlap'] = df['End'].gt(df['Start'].shift(fill_value=0))
df['group'] = df['overlap'].eq(False).cumsum()

df = df.sort_values(by='Tiebreak', ascending=False)
df = df.drop_duplicates(subset='group').drop(columns=['overlap','group'])

print(df)

Output

   Start  End  Tiebreak
2     15   20  0.628266
3     10   15  0.984022
5      4    8  0.531054
Chris
  • 15,819
  • 3
  • 24
  • 37
1

You can sort the values by Start and compute a cummax of the End, then form group by non-overlapping intervals and get the max Tiebreak with groupby.idxmax:

keep = (df
   .sort_values(by=['Start', 'End'])
   .assign(max_End=lambda d: d['End'].cummax(),
           group=lambda d: d['Start'].ge(d['max_End'].shift()).cumsum())
   .groupby('group', sort=False)['Tiebreak'].idxmax()
)

out = df[df.index.isin(keep)]

Output:

   Start  End  Tiebreak
2     15   20  0.628266
3     10   15  0.984022
5      4    8  0.531054
logic as image

The logic is to move left to right and start a new group when then is a "jump" (no overlap). As hard lines the intervals (in bold the greatest Tiebreak), and as dotted lines the cummax End.

enter image description here

Intermediates:

   Start  End  Tiebreak  max_End  group
0      1    6  0.376600        6      0
5      4    8  0.531054        8      0
1      5    7  0.050042        8      0
3     10   15  0.984022       15      1  # 10 ≥ 8
4     11   12  0.909033       15      1
2     15   20  0.628266       20      2  # 15 ≥ 15
mozway
  • 194,879
  • 13
  • 39
  • 75