This Python script I wrote to split overlapping ranges into unique ranges (last iteration). It produces correct output and outperforms the version given in the answer. I tested output against known correct method's output and output of another brute force approach. I confirmed correctness of all methods and my code is the most efficient.
An infinite number of boxes arranged in a line are numbered. Each can only hold one object, whatever was last put into the box. They are initially empty. Now imagine a list of triplets, first two elements of each triplet are integers (the first no greater than the second) and each triplet represents an instruction to put the third element into the boxes. Triplet (0, 10, 'A')
means "put 'A'
into boxes 0 to 10 (inclusive)" and after execution of the instruction boxes 0 to 10 each contain an instance of 'A'
. Empty boxes are to be ignored. The task is to describe the state of boxes after executing all instructions, using the least number of triplets. The rules are simple, there are narrow cases and general cases:
Narrow case: given triplets (s1, e1, d1)
and (s2, e2, d2)
, s1 < s2 < e1 < e2
is always False
(all pairings in the input conform to this). There are four sub cases:
Case 1: s1 = s2 and e2 < e1:
Whichever ends first wins. Boxes always hold the value from the winner, given
[(0, 10, 'A'), (0, 5, 'B')]
, after execution the state of the boxes is[(0, 5, 'B'), (6, 10, 'a')]
.Case 2: s1 < s2 and e2 < e1:
Whichever ends first wins, example:
[(0, 10, 'A'), (5, 7, 'B')]
->[(0, 4, 'A'), (5, 7, 'B'), (8, 10, 'A')]
.Case 3: s1 < s2 and e2 = e1:
Same rule as above, but here it is a tie. In case of ties, whichever starts later wins:
[(0, 10, 'A'), (6, 10, 'B')]
->[(0, 5, 'A'), (6, 10, 'B')]
.Case 4: s1 = s2 and e1 = e2:
This is special. In case of true ties, whichever comes later in the input wins. My code doesn't guarantee the objects in the boxes are from the latest instructions in cases like this (but it guarantees all boxes have only one object).
Additional rules:
If there are no updates, do nothing.
[(0, 10, 'A'), (5, 6, 'A')]
->[(0, 10, 'A')]
If there are gaps, leave them as is (those boxes are empty):
[(0, 10, 'A'), (15, 20, 'B')]
->[(0, 10, 'A'), (15, 20, 'B')]
If e1 + 1 = s2 and d1 = d2, join them (least amount of triplets).
[(0, 10, 'A'), (11, 20, 'A')]
->[(0, 20, 'A')]
The general case is when the primary condition of the narrow case doesn't hold, in case of true intersections. Whichever starts later wins.
[(0, 10, 'A'), (5, 20, 'B')]
->[(0, 4, 'A'), (5, 20, 'B')]
The brute force implementation is correct but slow and can handle the general cases:
def brute_force_discretize(ranges):
numbers = {}
ranges.sort(key=lambda x: (x[0], -x[1]))
for start, end, data in ranges:
numbers |= {n: data for n in range(start, end + 1)}
numbers = list(numbers.items())
l = len(numbers)
i = 0
output = []
while i < l:
di = 0
curn, curv = numbers[i]
while i < l and curn + di == numbers[i][0] and curv == numbers[i][1]:
i += 1
di += 1
output.append((curn, numbers[i-1][0], curv))
return output
Smart implementation that is performant but only handles the narrow case:
from typing import Any, List, Tuple
def get_nodes(ranges: List[Tuple[int, int, Any]]) -> List[Tuple[int, int, Any]]:
nodes = []
for ini, fin, data in ranges:
nodes.extend([(ini, False, data), (fin, True, data)])
return sorted(nodes)
def merge_ranges(data: List[List[int | Any]], range: List[int | Any]) -> None:
if not data or range[2] != (last := data[-1])[2] or range[0] > last[1] + 1:
data.append(range)
else:
last[1] = range[1]
def discretize_narrow(ranges):
nodes = get_nodes(ranges)
output = []
stack = []
actions = []
for node, end, data in nodes:
if not end:
action = False
if not stack or data != stack[-1]:
if stack and start < node:
merge_ranges(output, [start, node - 1, stack[-1]])
stack.append(data)
start = node
action = True
actions.append(action)
elif actions.pop(-1):
if start <= node:
merge_ranges(output, [start, node, stack.pop(-1)])
start = node + 1
else:
stack.pop(-1)
return output
Full script that generates narrow cases. Performance:
In [518]: sample = make_sample(2048, 65536, 16)
In [519]: %timeit descretize(sample)
4.46 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [520]: %timeit discretize_narrow(sample)
3.13 ms ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [521]: list(map(tuple, discretize_narrow(sample))) == descretize(sample)
Out[521]: True
How can I make my code faster? Inputs are sorted in ascending order, but my code assumes it isn't. If I split the top level data into non-overlapping ranges (taking advantage of input being sorted), accumulate triplets to a stack while there are overlaps and push the stack to discretize and clear the stack when overlaps are over, then join intermediate results, the code can be faster. I can't get this working.
And how can I handle the general case? I think I need to know when ranges end (four states) but can't handle it correctly:
def get_quadruples(ranges):
nodes = []
for ini, fin, data in ranges:
nodes.extend([(ini, False, -fin, data), (fin, True, ini, data)])
return sorted(nodes)
I can't post this on Code Review because this is a "how" question (I haven't achieved my goals). To demonstrate Interval Tree's slowness, this answer on Code Review uses it:
In [531]: %timeit discretize_narrow([(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')])
14.3 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [532]: %timeit merge_rows([(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')])
891 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [533]: data = [(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')]
In [534]: merge_rows(data) == discretize_narrow(data)
Out[534]: True
In [535]: sample = make_sample(256, 65536, 16)
In [536]: merge_rows(sample) == discretize_narrow(sample)
Out[536]: True
In [537]: %timeit discretize_narrow(sample)
401 µs ± 3.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [538]: %time result = merge_rows(sample)
CPU times: total: 78.1 ms
Wall time: 56 ms
It is not efficient but I don't think I can implement a more efficient version. Test cases can be generated using code in the linked GitHub page. My logic is the same as dictionary update and my brute force implementation is just that (using dictionary update to process the ranges, it is by definition correct). Correctness of my smart approach was verified using output of the brute force one.
Manual test cases:
In [539]: discretize_narrow([(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')])
Out[539]:
[[0, 1, 'B'],
[2, 7, 'C'],
[8, 8, 'D'],
[9, 10, 'A'],
[110, 150, 'E'],
[250, 255, 'C'],
[256, 270, 'D'],
[271, 294, 'C'],
[295, 300, 'E'],
[500, 600, 'F']]
In [540]: discretize_narrow([(0, 100, 'A'), (10, 25, 'B'), (15, 25, 'C'), (20, 25, 'D'), (30, 50, 'E'), (40, 50, 'F'), (60, 80, 'G'), (150, 180, 'H')])
Out[540]:
[[0, 9, 'A'],
[10, 14, 'B'],
[15, 19, 'C'],
[20, 25, 'D'],
[26, 29, 'A'],
[30, 39, 'E'],
[40, 50, 'F'],
[51, 59, 'A'],
[60, 80, 'G'],
[81, 100, 'A'],
[150, 180, 'H']]
Machine generated general case (and correct output):
In [542]: ranges = []
In [543]: for _ in range(20):
...: start = random.randrange(100)
...: end = random.randrange(100)
...: if start > end:
...: start, end = end, start
...: ranges.append([start, end, random.randrange(5)])
In [544]: ranges.sort()
In [545]: ranges
Out[545]:
[[0, 31, 0],
[1, 47, 1],
[1, 67, 0],
[10, 68, 0],
[15, 17, 2],
[18, 39, 0],
[19, 73, 3],
[25, 32, 0],
[26, 33, 1],
[26, 72, 2],
[26, 80, 2],
[28, 28, 1],
[29, 31, 4],
[30, 78, 2],
[36, 47, 0],
[36, 59, 4],
[44, 67, 3],
[52, 61, 4],
[58, 88, 1],
[64, 92, 1]]
In [546]: brute_force_discretize(ranges)
Out[546]:
[(0, 0, 0),
(1, 9, 1),
(10, 14, 0),
(15, 17, 2),
(18, 18, 0),
(19, 24, 3),
(25, 25, 0),
(26, 28, 1),
(29, 29, 4),
(30, 35, 2),
(36, 43, 0),
(44, 51, 3),
(52, 57, 4),
(58, 92, 1)]
Edit
Function that makes generic cases:
def make_generic_case(num, lim, dat):
ranges = []
for _ in range(num):
start = random.randrange(lim)
end = random.randrange(lim)
if start > end:
start, end = end, start
ranges.append([start, end, random.randrange(dat)])
ranges.sort()
return ranges
Sample inputs that code from existing answer disagrees with the correct result:
Test case:
[
[0, 31, 0],
[1, 47, 1],
[1, 67, 0],
[10, 68, 0],
[15, 17, 2],
[18, 39, 0],
[19, 73, 3],
[25, 32, 0],
[26, 33, 1],
[26, 72, 2],
[26, 80, 2],
[28, 28, 1],
[29, 31, 4],
[30, 78, 2],
[36, 47, 0],
[36, 59, 4],
[44, 67, 3],
[52, 61, 4],
[58, 88, 1],
[64, 92, 1]
]
Output:
[
(0, 0, 0),
(1, 9, 1),
(10, 14, 0),
(15, 17, 2),
(18, 18, 0),
(19, 24, 3),
(25, 25, 0),
(26, 28, 1),
(29, 29, 4),
(30, 35, 2),
(36, 43, 0),
(44, 51, 3),
(52, 57, 4),
(58, 88, 1)
]
The output is almost the same as the correct one, except for the last range, it should be (58, 92, 1)
instead of (58, 88, 1)
.
And another case:
[
[4, 104, 4],
[22, 463, 2],
[24, 947, 2],
[36, 710, 1],
[37, 183, 1],
[39, 698, 7],
[51, 438, 4],
[60, 450, 7],
[120, 383, 2],
[130, 193, 7],
[160, 562, 5],
[179, 443, 6],
[186, 559, 6],
[217, 765, 2],
[221, 635, 2],
[240, 515, 3],
[263, 843, 3],
[274, 759, 6],
[288, 389, 5],
[296, 298, 6],
[333, 1007, 1],
[345, 386, 5],
[356, 885, 3],
[377, 435, 5],
[407, 942, 7],
[423, 436, 1],
[484, 926, 5],
[496, 829, 0],
[559, 870, 5],
[610, 628, 1],
[651, 787, 4],
[735, 927, 1],
[765, 1002, 1]
]
Output:
[(4, 21, 4),
(22, 35, 2),
(36, 38, 1),
(39, 50, 7),
(51, 59, 4),
(60, 119, 7),
(120, 129, 2),
(130, 159, 7),
(160, 178, 5),
(179, 216, 6),
(217, 239, 2),
(240, 273, 3),
(274, 287, 6),
(288, 295, 5),
(296, 298, 6),
(299, 332, 5),
(333, 344, 1),
(345, 355, 5),
(356, 376, 3),
(377, 406, 5),
(407, 422, 7),
(423, 436, 1),
(437, 483, 7),
(484, 495, 5),
(496, 558, 0),
(559, 609, 5),
(610, 628, 1),
(629, 650, 5),
(651, 734, 4),
(735, 927, 1),
(928, 942, 7),
(943, 1007, 1)]
Again, it is almost correct, but the last three ranges are wrong.
They are (735, 927, 1), (928, 942, 7), (943, 1007, 1)
here but they should be (735, 1007, 1)
instead.
I have tested a few other inputs and the output of the proposed solution is correct, but I have identified some edge cases where it isn't.
I have just implemented the algorithm found here, but it isn't working correctly and fails the narrow case:
def discretize_gen(ranges):
nodes = get_nodes(ranges)
stack = []
for (n1, e1, d1), (n2, e2, _) in zip(nodes, nodes[1:]):
if e1:
stack.remove(d1)
else:
stack.append(d1)
start = n1 + e1
end = n2 - (not e2)
if start <= end and stack:
yield start, end, stack[-1]
Use like this: list(merge(discretize_gen(ranges)))
, merge
function can be found in the answer below.
It fails for many inputs, I have just wrote an ugly function to compare the output against the correct output:
def compare_results(ranges):
correct = brute_force_discretize(ranges)
correct_set = set(correct)
output = list(merge(discretize_gen(ranges)))
output_set = set(output)
errors = output_set ^ correct_set
indices = [(i, correct.index(e)) for i, e in enumerate(output) if e not in errors]
indices.append((len(output), None))
comparison = [(a, b) if c - a == 1 else (slice(a, c), slice(b, d)) for (a, b), (c, d) in zip(indices, indices[1:])]
result = {}
for a, b in comparison:
key = output[a]
val = correct[b]
if isinstance(key, list):
key = tuple(key)
result[key] = val
return result
A test case where it fails:
[(73, 104, 3), (75, 98, 0), (78, 79, 3), (83, 85, 3), (88, 90, 2)]
Comparison:
{(73, 74, 3): (73, 74, 3),
((75, 77, 0), (78, 87, 3)): [(75, 77, 0),
(78, 79, 3),
(80, 82, 0),
(83, 85, 3),
(86, 87, 0)],
((88, 90, 2), (91, 104, 3)): [(88, 90, 2), (91, 98, 0), (99, 104, 3)]}
I don't know why it fails, the answer has a score of 11 and is accepted, yet it doesn't work somehow. How to fix it?
Edit
There was an indentation error in the function make_generic_case
that causes the appending action to be unintentionally part of a if condition block, so that a new range will only be appended if start is greater than end, the start and end variables will be swapped and the range will be appended.
My intention was to append a new range for all generated pairs, I fixed the problem by unindenting the line. I only noticed the error now, the error was caused by copypasting, I pasted the code into Visual Studio Code and the auto indentation fixing broke the indentation.