3

What sort of algorithm/solution could be used to indicate the similarity (overlap/precision/recall/...) of two sets of ranges.

I can think of (or find online) hundreds of similar problems but never exact, but surely this "wheel" must have been invented already...

Lets say that the input data is something like:

Real      [ ## ###  #     ] or [(1,2),(4,6),(9,10)]  
Predicted [ ## #          ] or [(1,2),(4,4)]  

Output should be ~50 %

Should I for example AND bitmaps, use interval trees or what? Is there a nice functional or simple to write algorithm? Any meaningful measure of similarity will do, and so will any reasonable input format.

Thank you.

(realistic length ~4000 with < 50 intervals in each set)

  • Fascinating. A couple of days ago, played a bit with this question, which more-or-less produces the _dissimilarity_. Maybe it will provide ides. http://stackoverflow.com/questions/40367461/intersection-of-two-lists-of-ranges-in-python/40371246 – Gene Nov 04 '16 at 22:20
  • I've seen that one. The solutions seem unreasonably complex and only get me halfway there. Since I have no input, output or time constraints I was hoping for a kind of "obviously correct" implementation. –  Nov 04 '16 at 22:48

3 Answers3

1

You could break the segments to individual points, and tag each point as real/predicted and start/end.

Then sort the points, go over the sorted list and keep track of overlaps.

You don't even need to track whether the interval was originally from Real or Predicted - you just need to keep track whether there's one or two intervals at each point.

Example:

Real      [(1,2),(4,6),(9,10)]  
Predicted [(1,2),(4,4)]  

Broken down into points and sorted (S for Start, E for End):

[(1,S),(1,S),(2,E),(2,E),(4,S),(4,S),(4,E),(6,E),(9,S),(10,E)]

Then going through the array - keep track of how many segments "are open" and count the length of total open and 2 segments open.

The result is 2 segments open/total open.

Alex Libov
  • 1,481
  • 2
  • 11
  • 20
  • How well would that work for a realistic case with intervals such as [(2000,3000)...]? –  Nov 04 '16 at 18:22
  • What matters is the number of intervals not the actual values. You just go over the starts and ends of ranges - not through the whole range... – Alex Libov Nov 04 '16 at 18:51
  • Would adding my solution code improve this answer? Or should I just explain what I ended up doing? –  Nov 05 '16 at 17:50
  • You can add the code. It might be helpful to someone on the future – Alex Libov Nov 05 '16 at 17:54
1

Despite your worry stated in comments that the interval intersection algorithm is complex, it's not really. Here's mine adapted to determine similarity by computing the size of the intersection rather than the actual intervals in it. It has a nice symmetry.

This algorithm is O(|a| + |b|) assuming the input intervals are already sorted.

def similarity(a, b):
  ia = ib = prevParity = unionLen = isectLen = 0
  while True:
    aVal = a[ia / 2][ia % 2] if ia < 2 * len(a) else None
    bVal = b[ib / 2][ib % 2] if ib < 2 * len(b) else None
    if not aVal and not bVal: break
    if not bVal or aVal < bVal or (aVal == bVal and ia % 2 == 0):
      parity = prevParity ^ 1
      val = aVal
      ia += 1
    else:
      parity = prevParity ^ 2
      val = bVal
      ib += 1
    if prevParity == 0: unionStart = val
    elif parity == 0: unionLen += val - unionStart + 1
    if parity == 3: isectStart = val
    elif prevParity == 3: isectLen += val - isectStart + 1
    prevParity = parity
  return (0.0 + unionLen - isectLen) / unionLen

print similarity(a, b)

Note this is computing the Jaccard index as proposed by @TimothyShields, but its runtime and space depend on the number of intervals, where his depends on the total size of the intervals.

Gene
  • 46,253
  • 4
  • 58
  • 96
0

You could use the Jaccard index to measure the similarity, also known as the "intersection over union." It is a number between 0 and 1 where 0 means "these two sets do not overlap at all" and 1 means "these two sets are identical."

In Python 3, it's easy to implement:

def jaccard(A, B):
    if A or B:
        return len(A & B) / len(A | B)
    else:
        return 1.0

The A and B are two sets of values. Although not theoretically optimal, the following approach might be plenty fast enough for your needs.

real = [(1,2), (4,6), (9,10)]  
predicted = [(1,2), (4,4)]
real_set = set(x for a, b in real for x in range(a, b + 1))
predicted_set = set(x for a, b in predicted for x in range(a, b + 1))
print(jaccard(real_set, predicted_set))

This will give you 0.5.

A more efficient algorithm for computing the intersection and union of the line segments does exist, wherein there is no intermediate conversion to an enumeration of the integer elements, but I would stick to this simpler approach unless you will have line segments (a,b) where b - a is a very large number.

Timothy Shields
  • 75,459
  • 18
  • 120
  • 173