2

I am trying to iterate over two loops, the first segment_arr is a list of small (e.g. 0.1m) segments of a line (sorted by distance along the line), and the second fail_sections is a list of larger sections of that line (e.g. 5m) which "fail" in some way (also sorted by distance along the line). Ultimately I am trying to merge float ranges - there are some answers here, but all based on integers and with some caveats about overlapping.

I have a very naive version, which seems to be performant enough for my purposes, but it got me thinking, how could this be made more efficient:

new_seg_array = []
for seg in segment_arr:
        segfail = False
        for fail_sec in fail_sections:
            if seg.start_dist >= fail_sec.start_dist and seg.end_dist <= fail_sec.end_dist:
                segfail = True
        seg_data = Segment(start_dist=seg.start_dist,end_dist=seg.end_dist, does_fail=segfail)
        new_seg_array.append(seg_data)

The main problem is that the second loop is wasting iterations there is no chance that the condition can be true as the second range is well outside where we are in the first range. I considered using a generator expression e.g.

filtered_fail_sections = (x for x in fail_sections where x.start_distance > seg.start_distance and x.end_distance < seg.end_distance)
for fail_sec in filtered_fail_sections:

to filter the relevant fail segments, but it struck me that this was just doing the filtering work in the generator. Is there any way in Python of gradually reducing the scope of the second loop, so elements that are no longer relevant because they are above the range of the first loop are no longer iterated over? So over time the second loop would get smaller, until it was nothing, which would presumably help performance on much larger datasets. Or any other major efficiency improvements possible?

Stev_k
  • 2,118
  • 3
  • 22
  • 36

1 Answers1

1

You can store the fail-segments that are currently "active" on a heap, sorted by their end-value, then pop them from the heap when that end-value is smaller than the current segment's end. Then, if there are any elements on the heap, the current segment fails.

import heapq
segments = [(1,2), (3,4), (5,7), (9,10)]
fail = [(0,4), (8,11)]

idx = 0
covering = []
for (start, end) in segments:
    while idx < len(fail) and fail[idx][0] <= start:
        heapq.heappush(covering, fail[idx][1])
        idx += 1
    while covering and covering[0] < end:
        heapq.heappop(covering)
    print(start, end, covering)    

Example output:

1 2 [4]
3 4 [4]
5 7 []
9 10 [11]

This way, you not only know if the current segment fails, but also which and how many failing segments it intersects. Note that if you want to put the full failing segment onto the heap, you should use a tuple (end, f_seg) to ensure that the one with the lowest end is sorted first on the heap.

Of course, if you do not need to know the which or how many failing segments are intersected, you can just keep track of the max end value of any of the already started failing segments, without using a heap.

idx = 0
last_fail = -1
for (start, end) in segments:
    while idx < len(fail) and fail[idx][0] <= start:
        last_fail = max(last_fail, fail[idx][1])
        idx += 1
    print(start, end, last_fail >= end)

Complexity for the heap-based solution would be O(n+mlogm) for n segments and m fail-segments, and just O(n+m) without heap. (Assuming both lists are already sorted.)

tobias_k
  • 81,265
  • 12
  • 120
  • 179