Finding intervals of a set that are overlapping

Question

So I have a set containing the endpoints of intervals. For example,

Set s = {(1,4),(3,7),(5,8),(14,17),(0,2),(11,14)}

I need a way to find out how many overlapping intervals are there. In the above case, the answer would be 5 since

(1,4) --> (3,7), (0,2)
(3,7) --> (5,8),(0,2)
(5,8) --> 
(14,17) --> (11,14)

I need an algorithm that takes O(N log N) time to find out the sum. Now if I sort all the starting points and apply the answer suggested here on every point Find number range intersection I get an O(N^2) solution. Any clue on what kind of data structure I might need in addition to a set? Thanks!

This is related to [interval tree](http://en.wikipedia.org/wiki/Interval_tree) — qxixp, Sep 30 '13 at 03:34

score 4 · Accepted Answer · answered Sep 30 '13 at 07:18

4

Build a list of values (a, -1), (b, 1) for every interval [a, b]. Now sorting these lets you run through the array, counting how many intervals are currently open at each endpoint, just by aggregating the +1 and -1s.

It's important that (b, 1) comes after (b, -1) in the sorted list; because we want to count the intersection even if the intersection is at a single point.

Complete code here.

def overlaps(intervals):
    es = []
    for a, b in intervals:
        es.append((a, -1))
        es.append((b, 1))
    es.sort()
    result = 0
    n = 0
    for a, s in es:
        if s == -1: result += n
        n -= s
    return result

Note, this is trivially O(n log n), and doesn't need any fancy data-structures.

answered Sep 30 '13 at 07:18

Paul Hankin

54,811
11
92
118

This is great!! Sorry, my previous comment was confused because you use `-1` to begin and `1` to end an interval. You should have used the other way around. – justhalf Sep 30 '13 at 07:26
Great, the simplest and the fastest one. But it should be `result += n - 1` because no need to count the interval with itself. EDIT: oh, sorry too. I was confused by -1/+1 too. `result += n` is correct – kasitan Sep 30 '13 at 07:31
1

I think the +1 and -1 are the right way round, and result += n is right. n is updated after result, so an interval never counts against itself. If you run it on the intervals given in the question it produces the correct answer 4. – Paul Hankin Sep 30 '13 at 07:33
@PaulHankin: Consider the intervals: `intervals = []; intervals.append((1, 10)); intervals.append((5, 20)); intervals.append((15, 25))`. Your code returns 2, but it should be 3 since all three intervals have some overlap, just not all at the same time. If you change the last interval to be `(12, 25)`, then your code returns 3 instead. – stackoverflowuser2010 Aug 03 '18 at 01:05

justhalf · Answer 2 · 2013-09-30T04:22:27.723

First I assume from your example that (0,1) and (1,2) overlap.

Btw, your example is incorrect, (3,7) does not overlap with (0,2)

One way to solve this is:

Sort the intervals based on the starting point
Iterate from lowest starting point to highest point
2a. Count the number of previous endpoints larger or equal than current starting point.
2b. Add one to the count of current end point.

Step 1 can be done in O(n log n)
Step 2 is iterating over all the intervals while doing the count. So it's O(n * X) where X is the complexity to do the counting. With Fenwick Tree, we can do that in O(log n)¹, so step 2 can be completed in O(n log n) also.

So the overall complexity is O(n log n).

Pseudocode:

def fenwick_tree.get(num):
    return the sum from counter[0] to counter[num]

def fenwick_tree.add(num, val):
    add one to counter[num]

intervals = [...]
sort(intervals) # using the starting point as the key
init_fenwick_tree()
sum = 0
count = 0
for (starting_point, end_point) in intervals:
    sum = sum + (count - fenwick_tree.get(starting_point-1))
    fenwick_tree.add(end_point,1)
return sum

Python code (only works when the interval points are non-negative integers):

MAX_VALUE = 2**20-1
f_arr = [0]*MAX_VALUE

def reset():
    global f_arr, MAX_VALUE
    f_arr[:] = [0]*MAX_VALUE

def update(idx,val):
    global f_arr
    while idx<MAX_VALUE:
        f_arr[idx]+=val
        idx += (idx & -idx)

def read(idx):
    global f_arr
    if idx <= 0:
        return 0
    result = 0
    while idx > 0:
        result += f_arr[idx]
        idx -= (idx & -idx)
    return result

intervals = [(1,4),(3,7),(5,8),(14,17),(0,2),(11,14)]
intervals = sorted(intervals, key=lambda x: x[0])
reset()
total = 0
for processed, interval in enumerate(intervals):
    (start, end) = interval
    total += processed - read(start-1)
    update(end, 1)
print total

Will print 4, which comes from these overlaps:

(0,2) - (1,4)
(1,4) - (3,7)
(3,7) - (5,8)
(11,14) - (14,17)

Note that we can't get the overlapping intervals, since in the worst case there will be O(n^2) overlaps, which cannot be printed in O(n log n) time.

¹_{Actually Fenwick tree does the counting in O(log M) time, where M is the largest possible number of distinct values in the interval. Note that M <= 2n since there can only be that many distinct values. So it's also correct to say that Fenwick Tree does the counting in O(log n) time.}

jacobm · Answer 3 · 2013-09-30T04:21:14.390

1

Quick idea: First sort your intervals. Now go through your intervals, adding each to a min-heap ordered by endpoint. For each interval, remove everything from the heap that is smaller than that interval's start point. Every endpoint remaining in the heap represents an interval that starts before this interval and that overlaps it, so increment an overlaps by the size of the heap. Now add the current interval to the heap and continue.

In pseudocode:

Sort(intervals)
(firstStart, firstEnd) = intervals[0]
currentIntervals = MinHeap()
overlaps = 0

for each (start, end) in intervals[1:]:
  remove from currentIntervals all numbers < start
  overlaps += Size(currentIntervals)
  HeapPush(end, currentIntervals)
return overlaps

I haven't tested this, but it seems at least plausible.

edited Sep 30 '13 at 04:21

answered Sep 30 '13 at 04:03

jacobm

13,790
1
25
27

It will only count at most one overlap per interval, where the OP wants to have all possible interval counts. Your algorithm will produce output 2 to the input `{(0,3),(0,2),(1,3)}`, where it should be 3. – justhalf Sep 30 '13 at 04:13
1

Ah, okay. Updated to use a heap rather than a simple counter. – jacobm Sep 30 '13 at 04:21
Looks good, although it is `O(n^2)` in worst case due to imbalanced heap. Say for this input: `(0,10), (1,11), (2,12), ..., (10, 20)` the heap will be linear, so the `HeapPush` operation will be `O(n)`, resulting in `O(n^2)` total. You can use self-balancing binary tree such as AVL tree. – justhalf Sep 30 '13 at 04:25
1

I don't think that's right. Binomial heaps can't be imbalanced, they're always as bushy as possible. – jacobm Sep 30 '13 at 04:27
Oops, yes, sorry, I mixed that up with binary search tree. – justhalf Sep 30 '13 at 05:45
Forgot to vote up. Your answer is the best. Every element is put into the heap exactly once, and removed at most once. So the maximum number of operations is `2*n`, each taking at most `log n` steps, so it's indeed `O(n log n)`. – justhalf Sep 30 '13 at 06:55

score 0 · Answer 4 · answered Sep 30 '13 at 07:00

This can simply be done using Greedy technique. just follow the following steps:

Sort the intervals based on the starting point Iterate from lowest starting point to highest point Count the number of previous endpoints larger or equal than current starting point. Increment the count of current end point.

score 0 · Answer 5 · answered Nov 03 '17 at 04:56

Paul's answer is really smart. I don't think I could come up with that idea if it is an interview. Here I have another version which is also O(nlog(n)).

import heapq

def countIntervals(s):
    s.sort()
    end = [s[0][1]]
    res, cur = 0, 0
    for l in s[1:]:
        if l[0]>heapq.nsmallest(1, end)[0]:
            heapq.heappop(end)
        cur = len(end)
        res += cur
        end.append(l[1])
    return res

We maintain a min heap that stores upcoming endpoints. Every time when a new interval comes in, we should compare its start point with the smallest end point so far.

If the start point is bigger, it means the smallest end point (represents that interval) will never cause more overlaps. Hence we pop it out.

If the start point is smaller, it means all the end points (corresponding intervals) are overlapping with the new coming interval.

Then the number of end points ("cur") is the number of overlaps that brings by this new coming interval. After we updated the result, we add the new coming interval's endpoint to the heap.

Finding intervals of a set that are overlapping

5 Answers5

Linked