1

I am having a problem finding all overlapping ranges in two lists efficiently.
This problem is similar to This question, but with different inputs.

I have 2 input files, one that contains many lines of range and data pairs, and another that contains a list of ranges to find the intersections to.

I already wrote a file reader class that reads from the data file, returning objects, one at a time, that hold a list of range and data pairs, but am running into trouble when I try to find the overlaps of the two range lists.

Currently what I am doing is brute forcing it, comparing every range in the data list to every other range in the intersection list, but because the data file is very large, it is taking a long time.

Sample Objects:
This is the object in the data list:

public DataModel {
    private int start; {set; get;}
    private int end; {set; get;}
    //Other Data
}

The range Model is just a list of paired integers (start, end).

while (fileParser.hasNext()) {
    dataList = fileParser.next();
    for (DataModel data : dataList)
        for (RangeModel range : rangeList)
            if(overlaps(data, range))
                print(range.getString + " " + data.getString);
}

Edit for clarity:

The DataModel is given in smaller packets of similar ranges of varying length, but they are mostly under 20, so the comparison will be run repeatedly on the same RangeModel and each new DataModel. The total ranges in all the data is around 2 billion, but it doesn't really matter. Thanks for the help.

Community
  • 1
  • 1
EAKAE
  • 153
  • 1
  • 11
  • this may help http://stackoverflow.com/questions/5283047/intersection-union-of-arraylists-in-java – RNJ Jan 16 '13 at 20:55
  • Just mix all the ranges together, with some marker to indicate whether it is data or the range to find overlap. – nhahtdh Jan 16 '13 at 20:57
  • 1
    Is there any guarantee of logical order for each record in the file? For example, a pair of logically sorted Lists could be more easily and efficiently merged at intersections by iterating both inputs at the same time. – jerluc Jan 16 '13 at 20:58
  • There is no guarantee of order, but they will all be in relative order. The fileParser can return two lists one after the other that overlap, but the lists are sorted by start range of the union of all ranges in the list. – EAKAE Jan 16 '13 at 21:11
  • 1
    So DataModel and RangeModel are the same? How many ranges are there in BOTH files in total? – nhahtdh Jan 16 '13 at 21:17
  • the file has over 2 million ranges, but the parser returns approximately 15-30 at a time in a list. The range data will probably have less that 25 ranges, but there is no fixed limit. – EAKAE Jan 16 '13 at 21:21
  • Have you profiled your application to see where most of the time is being spent. If the range list is extremely small, and your dataList is large, its possible that you are spending most of your time on disk access and I/O, rather than on computation. – JohnnyO Jan 16 '13 at 21:33
  • Have you considered an [IntervalTree](http://stackoverflow.com/a/11189080/823393)? – OldCurmudgeon Jan 16 '13 at 22:00
  • I'm embarrassed to say that for a test file that contains about 1/20 of the data, the times it takes for varying lengths of the range list is almost negligible, for test ranges less than 500, which is much higher than it should be. Thanks for the advice. I will put it to use in some of my other projects. – EAKAE Jan 16 '13 at 23:18
  • @OldCurmudgeon: I know it can do stabbing query, but in this case, there is a whole range. Can it test on range? – nhahtdh Jan 16 '13 at 23:31
  • Couldn't you just stab both ends of the range? Hmm ... perhaps not ... well the code at the link is a working IntervalTree. I feel you could write a query to stab with an interval. If you feel it worthwhile I'll see what I can come up with. – OldCurmudgeon Jan 16 '13 at 23:45
  • Adding a query to the IntervalTree for intervals would work well, and it would reduce the number of comparisons greatly. – EAKAE Jan 16 '13 at 23:56

4 Answers4

1

I can think of different optimizations, but they depend on what kind of data you want available after the check.

Sorting both the data and the ranges and processing them in order provides an instant performance improvement, since it makes no sense to test a range starting in 100 against another one ending in 50.

Another improvement would be to 'compress' the ranges. If you have ranges like (1-10), (10-20), (20-30), then you could easily replace them with a single (1-30) range, and reduce the number of tests. You can create an appropiate AggregateRange class that keeps track of the identities of its composing ranges in case you still want to know which original range is causing the overlap.

Yet another improvement would be to smartly use the previous results as you process the data list. For example: Suppose you test data range (1-10) and it happens to not overlap. Were the next test data range be (2-8), you should not need to test it against the ranges, since your previous result guarantees that it will not overlap.

The basic idea behind this improvement would be to advance the start of any untested data ranges up to and including the end of the last non-overlaping data range. If the new start surpasses its own end, then no testing is required as it does not overlap. This means non-overlaping (1-20) should transform an untested (10-100) into an untested (20-100). This may be trickier to implement, so be careful not to overdo it.

  • If the ranges can each list can overlap then it's harder to sort them. – Neil Jan 16 '13 at 21:05
  • Not at all. You just need to define a proper way to compare ranges. My _natural_ way would be to test first for the start agains start, and then end against end. It's not the only way possible of sorting ranges, but its the one I think would be most useful in this scenario. This means that (1-10) < (1-100) < (2-10) < (2-1000). – Martín Valdés de León Jan 16 '13 at 21:07
  • @MartínValdésdeLeón: If I add (12,99) to your example, then it will be quite inefficient to scan through the list and check whether the end point of the previous ranges have ended or not. I'd consider a range as 2 entities (2 endpoints) and sort them instead. – nhahtdh Jan 16 '13 at 21:12
  • Sorting the data ranges is not reasonably possible due to the file format, but I can sort the test ranges. I'll try compressing the test ranges, but the test ranges have no guarantee of overlapping. There could be any number of test ranges. – EAKAE Jan 16 '13 at 21:16
  • I believe what you're suggesting is to sort first by end and then by start. That way the final ordering would be (1-10) < (2-10) < (12-99), (1-100) < (2-1000). And there are two more _natural_ alternatives together with this one and the first one I suggested. They are sure to affect the performance of the improvements, but I do not know just by looking which one would be best. It probably will depend on the specific implementation of the improvements. – Martín Valdés de León Jan 16 '13 at 21:20
1

Check whether my understanding is correct:

  • DataModel and RangeModel represent ranges. (DataModel may contain more data, but it is irrelevant).
  • There are approx. 2 million DataModels, and a small number of RangeModels. (My solution doesn't take advantage of this asymmetry, though)
  • It is necessary to retain the ranges in DataModel as different entities, even if they are overlapping. (If you are interested in the intersection only, you can collapse the ranges when they are near to each other as an optimization).

The method I'm going to describe can do range intersection between 2 list of ranges, regardless of how the ranges look like (overlapping, big range, etc.). The limit is on the sum of the sizes of the 2 list of ranges (sorting is bottleneck), and the number of ranges found (iterating through is a bottleneck).

Split the ranges into 2 EndPoints objects, which indicates: value (int), start or end of range (boolean), the starting EndPoint object (null in start of range; points to the EndPoint object that represents the start of range if end of range), tag (int, which mark whether it is a data, or the range to query).

Put all the EndPoints from both list of ranges together, sort them by value, tie break by putting start in front of end endpoint (if you consider touching being intersection). The complexity of sorting step is O((m + n)log(m + n)).

Loop through the sorted EndPoints according to this pseudo-code:

open_data = HashSet()
open_range = HashSet()

for e in endpoints:
  if e is start of range:
    if e is data:
      print e intersect with all in open_range
      open_data.add(e)
    else: // e is range to test
      print e intersect with all in open_data
      open_range.add(e)
  else: // e is end of range
    if e is data:
      open_data.remove(e.startPoint)
    else: // e is range to test
      open_range.remove(e.startPoint)

Adding and removing from the HashSet is amortized O(1). The problem is with printing intersection, which is O(k), where k is the number of intersections, and can be up to O(m * n) in worst case.

Combined, the complexity is O((m + n)log(m + n) + m * n) in worst case. You may be able to do better based on the property of the data. This is a very general solution.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Great answer. Last line should be `open_range.remove(e.startPoint)`, I think. – Martín Valdés de León Jan 16 '13 at 23:27
  • Sorry for the lack of clarity, but the DataModel is given in smaller packets of similar ranges of varying length, but they are mostly under 20, so the comparison will be run repeatedly on the same RangeModel and each new DataModel. The total ranges in **all** the data is around 2 billion, but it doesn't really matter. Thanks for the help. – EAKAE Jan 16 '13 at 23:52
0

The ideal solution will depend on the specific characteristics of your data, but sorting your two input sets would be a good first step, which will allow you to reduce the amount of comparisons you will need.

One option would be to create an array from min(startTime) to max(endTime), and at each position, store a reference to an input value which covers this range.

Thus, if your input was
A: [1-5] and B:[3-7], you could have an data structure that looks like

1: A
2: A
3: A,B
4: A,B
5: A,B
6: B
7: B

Then to test for an intersection of [2-4] with the data set, you would simply lookup 2,3,4 in your array list and concatenate the results.

Further speed improvements could be made if you only care about IF there is an intersection, as opposed to exactly who the intersection is with. Or if you only care about AN intersection, rather than ALL intersections.

JohnnyO
  • 3,018
  • 18
  • 30
  • 1
    The main problem with this type of approach is that there is a large number of possible positions to create an array for. The input positions go from 0 to Integer.max. Unfortunately, I'm looking for all intersections between the range list and the data lists. – EAKAE Jan 16 '13 at 21:24
  • If the total range of data is sparsely populated, then a SparseArray implementation seems like a good fit in that scenario. – JohnnyO Jan 16 '13 at 21:28
0

You can sort each list of ranges o(N ln N) and perform a merge sort of those ranges O(N)

This will show up any overlapping ranges with a minimum of CPU time.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130