4

I'm going to ask this question using Scala syntax, even though the question is really language independent.

Suppose I have two lists

val groundtruth:List[Range]
val testresult:List[Range]

And I want to find all of the elements of testresult that overlap some element in groundtruth.

I can do this as follows:

def overlaps(x:Range,y:Range) = (x contains y.start) || (y contains x.start)
val result = testresult.filter{ tr => groundtruth.exists{gt => overlaps(gt,tr)}}

But this takes O(testresult.size * groundtruth.size) time to run.

Is there a faster algorithm for computing this result, or a data structure that can make the exists test more efficient?


P.S. The algorithm should work on groundtruth and testresult generated with an expression like the following. In other words, there are no guarantees about the relationships between the ranges within a list, the Ranges have an average size of 100 or larger.

(1 to 1000).map{x =>
   val midPt = r.nextInt(100000);
   ((midPt - r.nextInt(100)) to (midPt + r.nextInt(100)));
}.toList
Ken Bloom
  • 57,498
  • 14
  • 111
  • 168

3 Answers3

9

Try an interval tree. Cormen, Leiserson, Rivest and Stein discuss these in (IIRC) chapter 14.

Alternatively, if your lists of intervals are both sorted and the intervals within a list are non-overlapping, then the following algorithm solves your problem in linear time and in a single pass over both lists:

(define interval cons)
(define lower car)
(define upper cdr)

(define (overlap a b)
  (cond ((or (null? a) (null? b)) '())
        ((< (upper a) (lower b))
         (overlap (cdr a) b))
        ((> (lower a) (upper b))
         (overlap a (cdr b)))
        (#t  ;; (car a) and (car b) overlap
             ;; EDIT: there's a bug in the following part.
             ;; The code shouldn't skip over both cars at once,
             ;; since they may also overlap with further intervals.
             ;; However, I'm too tired to fix this now.
         (cons (interval (max (lower a) (lower b))
                         (min (upper a) (upper b)))
               (overlap a b)))))

(I hope you can read Scheme :)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • The interval tree looks perfect. I'll go implement it when I get home. – Ken Bloom Apr 14 '11 at 19:26
  • @Ken: note that the linear time algorithm works on interval trees with some minor adjustments, since those do the sorting for you, but it might be easier to sort the intervals in a smart way than to produce a balanced interval search tree. – Fred Foo Apr 14 '11 at 22:02
0

If you could sort the groundtruth list by range beginning value, then for each range in testresult you can do a binary search to get the subset of ranges whose low bound is less than or equal to the range in question. Then you need to sequentially search that subset for those whose high bound is greater than or equal to the high bound of the range you're testing.

The worst case is still O(n^2), since it's possible that all of the groundtruth ranges would have a low bound that meets the criteria, but the run time with actual data would likely be much less.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
0

If groundtruth were stored in a hashed set, checking the presence of members of test result in it would be O(n).

Edit: I didn't realise you were working solely with ranges represented by their endpoints. D'oh!

Some sort of tree-based structure has got to be your best bet.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • 1
    The "overlaps" relation isn't an [equivalence relation](http://en.wikipedia.org/wiki/Equivalence_relation), so it doesn't lend itself constant time lookup in a hash table. – Ken Bloom Apr 15 '11 at 02:38
  • Correct, but equality of integers is, so you can look up the individual integers in testresult in the groundtruth hashset. The total cost of the operation would be linear in the size of testresult. – Marcin Apr 15 '11 at 12:00
  • So you're suggesting `val ranges=new scala.collection.mutable.HashSet[Int]; for ( gt <- groundtruth; point <- gt) ranges += point; val result = testresult.filter{ tr => tr.exists{point => ranges(point)}}` That can work well when the ranges are short, but it wastes a lot of time when the ranges are long (i.e average size is 100). – Ken Bloom Apr 15 '11 at 15:47
  • I'm pretty foggy on what your scala code is supposed to mean. I hadn't realised that your test results were stored as ranges. – Marcin Apr 15 '11 at 16:34