3

I have a dataframe with start_dtm and end_dtm.

I need to flag all rows that have a [start_dtm,end_dtm] overlapping with any other row, and for each of the rows that have an overlap, also the index of the overlapping rows.

I do that with a for loop, but that is extremely inefficient.

def do_overlap(b,a):
    return a['start_dtm']<=b['end_dtm'] and a['end_dtm']>=b['start_dtm']

for _,r1 in df.iterrows():
    for _,r2 in df.iterrows():
          do_overlap(r1,r2)

Sample data:

      start_dtm     end_dtm
1   2018-02-01  2018-02-01
8   2018-02-14  2018-02-15
9   2018-03-01  2018-03-01
10  2018-04-13  2018-04-13
11  2018-04-20  2018-04-20
... ... ...
944676  2018-08-09  2018-08-10
944677  2018-09-06  2018-09-11
944678  2018-10-16  2018-10-17
944679  2018-10-31  2018-11-02
944680  2018-11-28  2018-11-29
00__00__00
  • 4,834
  • 9
  • 41
  • 89
  • Would this help? https://stackoverflow.com/questions/35459316/compare-each-row-with-all-rows-in-data-frame-and-save-results-in-list-for-each-r – Tõnis Piip Nov 01 '21 at 09:33
  • yes I could use that, but how to generate a dataframe with all row pairs to run an apply on? – 00__00__00 Nov 01 '21 at 09:55

1 Answers1

3

test data

df = pd.DataFrame({
    "start":[0,3,4,8,9],
    "end":[4,6,7,9,10],
})

solution

melt the dataframe and sort

 melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])

melted looks like this

   index variable  value
0      0    start      0
1      1    start      3
5      0      end      4
2      2    start      4
6      1      end      6
7      2      end      7
3      3    start      8
8      3      end      9
4      4    start      9
9      4      end     10

The index column corresponds to original row number, variable column indicates the type of endpoint and value column is the value of the endpoint

Now take a look at this calculation

    (pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose()

the result is

   0  1  2  3  4
0  1  0  0  0  0
1  0  1  0  0  0
5 -1  0  0  0  0
2  0  0  1  0  0
6  0 -1  0  0  0
7  0  0 -1  0  0
3  0  0  0  1  0
8  0  0  0 -1  0
4  0  0  0  0  1
9  0  0  0  0 -1

Each column corresponds to an interval (row in the original dataframe). Values of 1 correspond to a start point, -1 corresponds to an endpoint. Combine with a cumulative sum and we get a 0-1 matrix where the 1s correspond to the span of the interval.

interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()

interval_matrix looks like this

     0  1  2  3  4
  0  1  0  0  0  0
  1  1  1  0  0  0
  5  0  1  0  0  0
  2  0  1  1  0  0
  6  0  0  1  0  0
  7  0  0  0  0  0
  3  0  0  0  1  0
  8  0  0  0  0  0
  4  0  0  0  0  1
  9  0  0  0  0  0

Now two intervals overlap if, and only if, one of the start points is contained within the other interval. We can use this to get subset this dataframe by start points only. If we imagine each interval as a node in a graph with edges between intervals that overlap then the next piece of code is calculating directed edges.

start_overlaps = interval_matrix.loc[melted["variable"] == "start"]

start_overlaps looks like this

   0  1  2  3  4
0  1  0  0  0  0
1  1  1  0  0  0
2  0  1  1  0  0
3  0  0  0  1  0
4  0  0  0  0  1

Lastly add this to its transpose to 'make the arrows bidirectional' and convert to boolean

overlaps = (start_overlaps + start_overlaps.transpose()) > 0

overlaps looks like this

       0      1      2      3      4
0   True   True  False  False  False
1   True   True   True  False  False
2  False   True   True  False  False
3  False  False  False   True  False
4  False  False  False  False   True

summary

melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
overlaps = (start_overlaps + start_overlaps.transpose()) > 0

A slightly faster solution with numpy (np)

melted=df.melt(ignore_index=False).sort_values(["value", "variable"])
interval_matrix=np.multiply(pd.get_dummies(melted.index).values, np.expand_dims(melted["variable"].map({"start":1, "end":-1}).values,1)).cumsum(axis=0)
start_overlaps = interval_matrix[melted["variable"] == "start", :]
pd.DataFrame(start_overlaps + start_overlaps.transpose() > 0)
Riley
  • 2,153
  • 1
  • 6
  • 16
  • I like this answer a lot, very fast. – 00__00__00 Nov 01 '21 at 13:04
  • I just have one issue here: overlaps = (start_overlaps + start_overlaps.transpose()) > 0 For me,.start_overlaps has different column and index names; when I sum I get a all NaN matrix – 00__00__00 Nov 01 '21 at 13:05
  • Can you post code for reconstructing your data? The solution might assume your original dataframe has a range index. I'll also update with another solution that might avoid the issue altogether. – Riley Nov 01 '21 at 13:20
  • I believe I have solved by changing melted=df.reset_index(drop=True).reset_index() – 00__00__00 Nov 01 '21 at 13:23
  • and: (start_overlaps.reset_index(drop=True) + start_overlaps.reset_index(drop=True).transpose()) – 00__00__00 Nov 01 '21 at 13:23
  • I believe this makes the answer robust to non-range index cases – 00__00__00 Nov 01 '21 at 13:24