3

I have a huge database table with n integer intervals (for instance {1-5}, {4-16}, {6434-114343}) and need to find out which intervals overlap each other. There's a wealth of similar questions on SO, but the difference is I need returned, for each interval respectively, the set of overlapping intervals.

      ------------------ A -------------------
    ------ B -------               ----- D -----
          --------- C --------- 

For this example, output would be A:{B,C,D} B:{A,C} C:{A,B} D:{A}

Worst case, all intervals could overlap each other, producing an output of size O(n2). This is no better than the naïve solution (compare every pair of intervals). In practice however, I know few of my intervals will overlap other intervals, and when they do, only up to 5 other intervals.

Given this info, how should I solve the problem? (Optimally, I would like a SQL query solution since the data is in the database, but I think only a regular algorithmic solution is possible.)

Community
  • 1
  • 1
Gruber
  • 4,478
  • 6
  • 47
  • 74
  • 1
    Maybe you should specify what "huge" means in this context. Thousands? Millions? Billions? And if there really exists a pure SQL solution (I have my doubts), you may want to tell us how the data is stored in the database, e.g. do you have separate columns for range name/ID, start and end of the interval, or are start and end stored as a string value "x-y" for example. It may also be interesting to know the range of the numeric values, e.g. what is the smallest/biggest interval start/end to expect? – Mecki Jan 08 '13 at 10:25
  • @Mecki: In this case "huge" means n=100,000. In the database, each interval has a unique primary key integer value, a start integer, as well as an end integer. The numbers range from 0 to 4*10^9. – Gruber Jan 08 '13 at 10:28

2 Answers2

8

The typical programmatic solution for your problem is building an interval tree out of all ranges and then performing one lookup per interval, which gives you a list of all intersecting intervals in O(log n) time. Here's a sample what such an interval tree looks like:

Interval Tree Sample

In you case, though, you would store the primary keys in the tree nodes as well, so given the following dates (finding overlapping dates is a common problem that can be solved with interval trees)

Sample Date Intervals

your tree would look like this

Sample Tree for Date Intervals

So if I want to know which intervals overlap with C, I search for the start of C, 1843, and the tree tells me, that this value is only within the interval C, which is the interval I'm testing for, so I can ignore it. Then I search for the end of C, 1907, and the tree tells me, that it is in the intervals A, B, and C, again I can ignore C, and thus my result set is A and B.

I admit, the lookup in such a tree is not that intuitive as one might expect. I'll try to explain it here as good as I can: You start at the top root node and at each node decide to either go left or go right until you hit a leave node (a node that has no children any more). If the node value is bigger than the value you are searching for, you go left. If the node value is smaller than the value you are searching for, you go to the right. What if you the node value equals exactly the value you are searching for? It depends! If you are searching for the start of an interval, equal value means you go right, if you search for the end of an interval, equal value means you go left. This is very important. Once you reached a leave node, you are done and all intervals you found in any node on your way to that leave node, including the interval stored in the leave node itself (if any) make up your result set, not only the interval stored in the leave node. That means you must collect any intervals you come across while performing your search.

Now back to your initial question: Can all this be done in SQL? Yes, it can be done. I'm not sure if you really want to do it, though. You can transform your current SQL table data into a SQL table that represents an interval tree and then directly perform the lookups in that interval tree table. At least I found someone who did exactly that. He tries to find all date ranges the cover a given date without having to compare the date against all existing ranges in the database:

A Static Relational Interval Tree

He even uses a nifty trick to optimize lookups for speed, cutting down CPU usage dramatically for both, building the lookup table and performing the actual lookups (which makes the whole thing quite complicated).

Mecki
  • 125,244
  • 33
  • 244
  • 253
  • Thanks for a great answer. I suspected an interval tree would be involved in the solution somehow, and to me it seems like your solution is correct. Implementation-wise, it is more complex than maxim1000's solution however and I'm not sure how they compare in practice concerning performance. Whereas SQL is concerned, it sounds like a really complex thing to attempt! – Gruber Jan 08 '13 at 13:22
  • 1
    @Gruber Well, to be honest I was wondering anyway if a brute force solution in SQL would be really that bad to begin with. You need to run 100'000 SQL queries for such a solution, however if you need to only check these intervals once a day (or even less often), that is really not a problem. Also your SQL server may be able to perform more than 1000 such queries within a single second if it has "enough power" ;-) – Mecki Jan 08 '13 at 17:10
  • Brute-forcing in general is counter to my sense of aesthetics :) but as I think about it, it probably is the least complex and failsafe approach. Probably fast too, since I know overlaps are rare. Very little code and easy to understand; I only need to check each interval endpoint value to find intersections: `SELECT * FROM myTable where beginPoint <= @myValue and endPoint >= @myValue` – Gruber Jan 09 '13 at 07:07
  • I've implemented the brute force algorithm. On my input, it took 1 hour. I also implemented the solution of maxim1000; while more complex it took only about 1 minute. – Gruber Jan 10 '13 at 13:55
  • @Mecki: You write that we can find all intersections in `O(log n)` time. Shouldn't it be `m*O(log n)` time where `m` is the number of intersections? I think getting the first intersection is a `O(log n)` operation. – rookie May 19 '15 at 13:42
  • 1
    @rookie It's not `m*O(log n)`, it's actually `O(log m+n)` (n is the number of intervals and m the number of reported results, see http://tinyurl.com/pc5zmsx), but that is basically the same as `O(log n)` when you look at it in a coarse context. Further the big-O notation tries to ignore factors that depend on the **kind** of data, big-O tries to express complexity in regards to the **amount** of data (number of entries in list/tree, not how many of these may overlap). `O(1)` is `O(1)` and not `O(100)`, even if the operation can take 100 times longer for certain kind of data than for other – Mecki May 19 '15 at 20:05
  • @Mecki: I see, thank you. Can you explain what you mean by coarse context (do you mean large `n`)? – rookie May 20 '15 at 14:48
  • @rookie You don't have to use "@Mecki", answer owner is always informed about new comments :) With "coarse" I mean that Big-O notation only cares for something being constant, logarithmic, linear, quadratic and so on. See also http://stackoverflow.com/a/5872270/15809 – Mecki May 21 '15 at 15:58
  • Didn't know that -- my bad! I guess I had thought that in the worst case you'd find all `m` intersections at `log n` depth. Do you have an intuitive explanation for the `log m+n`? Maybe it's best if I look at the proof. – rookie May 21 '15 at 16:24
2

Build a sorted sequence of interval starts and ends, then traverse it, every time updating list of current intervals, report any new found intersection.

Something like this:

std::vector<TBorder> borders;
for(auto i=intervals.begin();i!=intervals.end();++i)
{
    borders.push_back(TBorder(i.Start(),Start));
    borders.push_back(TBorder(i.End(),End));
}
std::sort(borders.begin(),borders.end());
std::set<int> currentIntervals;
for(auto b=borders.begin();b!=borders.end();++b)
{
    if(b.IsEnd())
        currentIntervals.erase(b.IntervalIndex());
    else
    {
        currentIntervals.insert(b.IntervalIndex());
        if(currentIntervals.size()>1)
            ReportIntersection(currentIntervals);
    }
}

Generally it's O(n*log n) (assuming number of intersections is O(1) ).

But if you already have intervals sorted by e.g. start, likely sorting can be done in O(n) (again assuming number of intersection is O(1)).

maxim1000
  • 6,297
  • 1
  • 23
  • 19
  • I've implemented your solution in a database stored procedure, using temporary tables and cursors for looping. The algorithm runs really fast, only 50s for n=100,000. – Gruber Jan 10 '13 at 14:03