25

Problem

Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?

Picture for illustration: enter image description here

Interval endpoints are integers ( ≤ 2128-1 ) and they are always both 2n long and aligned on the m×2n lattice (so you can make a binary tree out of them).

Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).

The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.

The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:

[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...

What have I tried?

So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).

While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.

I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.


So, how would I go about solving this problem in a efficient way?

Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.

  • 3
    How'bout a simple linear one-pass search? Seems to need less memory, especially if you store only one pair of intervals at a time. –  Aug 09 '12 at 20:16
  • +1, @H2CO3. I feel like this problem is somewhat similar to [finding the intersection of two sorted arrays](http://stackoverflow.com/a/2400236/953482), so the linear search could possibly be applied here. – Kevin Aug 09 '12 at 20:23
  • @Kevin yes, basically I'm just thinking KISS and a b-tree seems a bit of an overhead. –  Aug 09 '12 at 20:24
  • @H2CO3 Yes, that's what I figured but the problem is that the intervals need not be exact matches but can merely be a sub-interval of one another. –  Aug 09 '12 at 20:25
  • Perhaps you could do a first pass where you discard or merge all sub-intervals. – Kevin Aug 09 '12 at 20:27
  • In the example you gave the blue intervals are always a subset of the red interval. Is that true always? Also it seems that if the intervals overlap then the smaller interval is a subset of the bigger interval. Is that always true? – VSOverFlow Aug 09 '12 at 20:28
  • @Tibor sorry then. Your example suggests that every interval's starting position matches. –  Aug 09 '12 at 20:29
  • @H2CO3 Updated a picture to better reflect that. –  Aug 09 '12 at 20:32

4 Answers4

10

Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.

Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals

Algorithm:

  • No need to distinguish between A and B, just sort all start points and end points in the ascending order
  • Set the parentheses depth counter to zero
  • Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
  • Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections
Arik G
  • 484
  • 2
  • 5
  • This is a great solution! I will have to keep two counters for A and for B, because multiple intervals can overlap in both and I want them to count for one but otherwise I can calculate A ∩ B, A - B, B - A and A ∪ B, all in one pass! –  Aug 10 '12 at 06:10
  • I'd just like to add that depth != 0 gives the union. – Roman Reiner Jun 18 '14 at 07:49
  • 6
    Unless I'm misunderstanding this, this isn't the difference, it's just XOR. If there's an interval that doesn't exist in A, but does exist in B, depth of 1 will yield it. – Alex Beals Nov 13 '15 at 06:56
  • Yes, line segments where "A and not B" is not given by a simple depth = 1 depth. – MattArmstrong Dec 17 '22 at 00:18
8

Your intervals are sorted which is great. You can do this in linear time with almost no memory.

Start by "flattening" your two sets. That is for set A, start from the lowest interval, and combine any overlapping intervals until you have an interval set that has no overlaps. Then do that for B.

Now take your two sets and start with the first two intervals. We'll call these the interval indices for A and B, Ai and Bi.

Ai indexes the first interval in A, Bi the first interval in B.

While there are intervals to process do the following:

Consider the start points of both intervals, are the start points the same? If so advance the start point of both intervals to the end point of the smaller interval, emit nothing to your output. Advance the index of the smaller interval to the next interval. (That is if Ai ends before Bi, then Ai advances to the next interval.) If both intervals end in the same place, advance both Ai and Bi and emit nothing.

Is the one start point earlier than the other start point? If so emit the interval from the earlier start point to either a) the start of the later endpoint, or b) the end of the earlier end point. If you chose option b, advance the index of the eariler interval.

So for example if the interval at Ai starts first, you emit the interval from start of Ai to start of Bi, or the end of Ai whichever is smaller. If Ai ended before the start of Bi, you advance Ai.

Repeat until all intervals are consumed.

Ps. I assume you don't have spare memory to flatten the two interval sets into separate buffers. Do this in two functions. A "get next interval" function that advances the interval indices, which does the flattening as necessary, and feed flattened data to the differencing function.

Rafael Baptista
  • 11,181
  • 5
  • 39
  • 59
5

What you are looking for is a Sweep line algorithm. A simple logic should tell you when the Sweep line is intersecting an interval in both A and B and where it intersects only one set.

This is very similar to this problem. Just consider that you have a set of vertical lines passing through the end points of the B's segments.

This algorithm complexity is O((m+n) log (m+n)) which is the cost of the initial sort. The sweep line algorithm on a sorted set takes O(m+n)

Community
  • 1
  • 1
Samy Arous
  • 6,794
  • 13
  • 20
2

I think you should use boost.icl (Interval Container Library) http://www.boost.org/doc/libs/1_50_0/libs/icl/doc/html/index.html

#include <iostream>
#include <boost/icl/interval_set.hpp>

using namespace boost::icl;

int main()
{
  typedef interval_set<int> TIntervalSet;
  TIntervalSet intSetA;
  TIntervalSet intSetB;

  intSetA += discrete_interval<int>::closed( 0, 2);
  intSetA += discrete_interval<int>::closed( 9,15);
  intSetA += discrete_interval<int>::closed(12,15);

  intSetB += discrete_interval<int>::closed( 1, 2);
  intSetB += discrete_interval<int>::closed( 4, 7);
  intSetB += discrete_interval<int>::closed( 9,10);
  intSetB += discrete_interval<int>::closed(12,13);

  std::cout << intSetA << std::endl;
  std::cout << intSetB << std::endl;

  std::cout << intSetA - intSetB << std::endl;

  return 0;
}

this prints

{[0,2][9,15]}
{[1,2][4,7][9,10][12,13]}
{[0,1)(10,12)(13,15]}
  • 1
    Nice solution, I didn't know of the ICL. However, this seems to be operating in RAM, so it doesn't really scale to hundreds of millions of intervals. –  Aug 12 '12 at 12:42