test data
df = pd.DataFrame({
"start":[0,3,4,8,9],
"end":[4,6,7,9,10],
})
solution
melt the dataframe and sort
melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
melted
looks like this
index variable value
0 0 start 0
1 1 start 3
5 0 end 4
2 2 start 4
6 1 end 6
7 2 end 7
3 3 start 8
8 3 end 9
4 4 start 9
9 4 end 10
The index
column corresponds to original row number, variable
column indicates the type of endpoint and value
column is the value of the endpoint
Now take a look at this calculation
(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose()
the result is
0 1 2 3 4
0 1 0 0 0 0
1 0 1 0 0 0
5 -1 0 0 0 0
2 0 0 1 0 0
6 0 -1 0 0 0
7 0 0 -1 0 0
3 0 0 0 1 0
8 0 0 0 -1 0
4 0 0 0 0 1
9 0 0 0 0 -1
Each column corresponds to an interval (row in the original dataframe). Values of 1 correspond to a start point, -1 corresponds to an endpoint. Combine with a cumulative sum and we get a 0-1 matrix where the 1s correspond to the span of the interval.
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
interval_matrix
looks like this
0 1 2 3 4
0 1 0 0 0 0
1 1 1 0 0 0
5 0 1 0 0 0
2 0 1 1 0 0
6 0 0 1 0 0
7 0 0 0 0 0
3 0 0 0 1 0
8 0 0 0 0 0
4 0 0 0 0 1
9 0 0 0 0 0
Now two intervals overlap if, and only if, one of the start points is contained within the other interval. We can use this to get subset this dataframe by start points only. If we imagine each interval as a node in a graph with edges between intervals that overlap then the next piece of code is calculating directed edges.
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
start_overlaps
looks like this
0 1 2 3 4
0 1 0 0 0 0
1 1 1 0 0 0
2 0 1 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
Lastly add this to its transpose to 'make the arrows bidirectional' and convert to boolean
overlaps = (start_overlaps + start_overlaps.transpose()) > 0
overlaps
looks like this
0 1 2 3 4
0 True True False False False
1 True True True False False
2 False True True False False
3 False False False True False
4 False False False False True
summary
melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
overlaps = (start_overlaps + start_overlaps.transpose()) > 0
A slightly faster solution with numpy
(np)
melted=df.melt(ignore_index=False).sort_values(["value", "variable"])
interval_matrix=np.multiply(pd.get_dummies(melted.index).values, np.expand_dims(melted["variable"].map({"start":1, "end":-1}).values,1)).cumsum(axis=0)
start_overlaps = interval_matrix[melted["variable"] == "start", :]
pd.DataFrame(start_overlaps + start_overlaps.transpose() > 0)