I have data from multiple csv files and I need to merge the tables into one table. The data in question is text dump of GeoLite2 database, and there are literally millions of rows, simply loading the data into lists takes 2927MiB.
The tables contain information about IP networks, some tables contain information about ASN, some about city, and some others about country, these tables have different keys (IP networks), and they may contain common keys, I intend to merge these tables into one table containing information about ASN, country and city of all networks listed.
The question is related to my previous question, but it is different.
Imagine an infinite boxes arranged in a line, they are numbered using unique integers, and all are initially empty. This time, all boxes can hold infinitely many values, but they can only hold unique values. Meaning, if you put A into box 0, box 0 contains A, but after that, no matter how many times you put A into box 0, box 0 always contains exactly 1 instance of A. But if you put B into box 0, box 0 now contains A and B. But if you put B into the box again, box 0 still contains 1 instance of A and 1 instance of B.
Now there are many triplets, the first two elements are integers, they correspond to start and end of an integer range (inclusive), each triplet describes a continuous integer range of boxes (meaning the number of every box is the number of the previous box plus one) with the same object.
For example, (0, 10, 'A')
means boxes 0 to 10 contain an instance of 'A'
.
The task is to combine the information from the triplets and describe the state of the boxes in the least amount of triplets, in this case the third elements are set
s.
Input (0, 10, 'A')
-> Output (0, 10, {'A'})
, explanation: boxes 0 to 10 contain an instance of 'A'
.
Input (0, 10, 'A'), (11, 20, 'A')
-> Output (0, 20, {'A'})
, explanation: boxes 0 to 10 contain an instance of 'A'
, and boxes 11 to 20 also contain an instance of 'A'
, 11 is 10 + 1, so boxes 0 to 20 contain an instance of 'A'
.
Input (0, 10, 'A'), (20, 30, 'A')
-> Output (0, 10, {'A'}), (20, 30, {'A'})
, explanation: boxes 0 to 10 contain an instance of 'A'
, and boxes 20 to 30 also contain an instance of 'A'
, all other boxes are empty, and 20 is not adjacent to 10, don't merge.
Input (0, 10, 'A'), (11, 20, 'B')
-> Output (0, 10, {'A'}), (11, 20, {'B'})
Input (0, 10, 'A'), (2, 8, 'B')
-> Output (0, 1, {'A'}), (2, 8, {'A', 'B'}), (9, 10, {'A'})
, explanation: boxes 0 to 10 have 'A'
, while boxes 2 to 8 have 'B'
, so boxes 2 to 8 have {'A', 'B'}
.
Input (0, 10, 'A'), (5, 20, 'B')
-> Output (0, 4, {'A'}), (5, 10, {'A', 'B'}), (11, 20, {'B'})
explanation: same as above.
Input (0, 10, 'A'), (5, 10, 'A')
-> Output (0, 10, {'A'})
, explanation: boxes 0 to 10 have 'A'
, the second triplet adds no new information and is garbage, discard it.
My current code that produces correct output for some test cases but raises KeyError
for others:
import random
from collections import defaultdict
from typing import Any, List, Tuple
def get_nodes(ranges: List[Tuple[int, int, Any]]) -> List[Tuple[int, int, Any]]:
nodes = []
for ini, fin, data in ranges:
nodes.extend([(ini, False, data), (fin, True, data)])
return sorted(nodes)
def combine_gen(ranges):
nodes = get_nodes(ranges)
stack = set()
actions = []
for node, end, data in nodes:
if not end:
if (action := (data not in stack)):
if stack and start < node:
yield start, node - 1, stack.copy()
stack.add(data)
start = node
actions.append(action)
elif actions.pop(-1):
if start <= node:
yield start, node, stack.copy()
start = node + 1
stack.remove(data)
def merge(segments):
start, end, data = next(segments)
for start2, end2, data2 in segments:
if end + 1 == start2 and data == data2:
end = end2
else:
yield start, end, data
start, end, data = start2, end2, data2
yield start, end, data
def combine(ranges):
return list(merge(combine_gen(ranges)))
It produces correct output for the following test cases:
sample1 = [(0, 20, 'A'), (10, 40, 'B'), (32, 50, 'C'), (40, 50, 'D'), (45, 50, 'E'), (70, 80, 'F'), (90, 100, 'G'), (95, 120, 'H'), (131, 140, 'I'), (140, 150, 'J')]
sample2 = [(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')]
sample3 = [(0, 100, 'A'), (10, 25, 'B'), (15, 25, 'C'), (20, 25, 'D'), (30, 50, 'E'), (40, 50, 'F'), (60, 80, 'G'), (150, 180, 'H')]
sample4 = [(0, 16, 'red'), (0, 4, 'green'), (2, 9, 'blue'), (2, 7, 'cyan'), (4, 9, 'purple'), (6, 8, 'magenta'), (9, 14, 'yellow'), (11, 13, 'orange'), (18, 21, 'green'), (22, 25, 'green')]
I won't include the expected output for them here, run my code and you will find out what the outputs are, the outputs are correct.
I have written a function to make test cases and a guaranteed correct but inefficient solution, and my efficient code raises KeyError
when fed machine generated inputs.
def make_generic_case(num, lim, dat):
ranges = []
for _ in range(num):
start = random.randrange(lim)
end = random.randrange(lim)
if start > end:
start, end = end, start
ranges.append([start, end, random.randrange(dat)])
ranges.sort(key=lambda x: (x[0], -x[1]))
return ranges
def bruteforce_combine(ranges):
boxes = defaultdict(set)
for start, end, data in ranges:
for n in range(start, end + 1):
boxes[n].add(data)
boxes = sorted(boxes.items())
output = []
lo, cur = boxes.pop(0)
hi = lo
for n, data in boxes:
if cur == data and n - hi == 1:
hi = n
else:
output.append((lo, hi, cur))
lo = hi = n
cur = data
output.append((lo, hi, cur))
return output
Because my code isn't working properly I CAN'T post it on Code Review, because Code Review only reviews working code and mine isn't.
Answers are required to use make_generic_case(512, 4096, 16)
to get test cases and verify proposed solution's correctness against the output of bruteforce_combine
, bruteforce_combine
is by definition correct (my logic is defaultdict(set)
).
What is a more efficient way to combine the overlapping ranges?
Both existing answers are not ideal, the first gives the correct result but is very inefficient and will never finish processing my millions of rows:
In [5]: for _ in range(256):
...: case = make_generic_case(512, 4096, 16)
...: assert bruteforce_combine(case) == combine(case)
In [6]: case = make_generic_case(512, 4096, 16)
In [7]: %timeit combine(case)
9.3 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The second is much more efficient, but I haven't tested thoroughly yet.
I have confirmed the correctness of the code from the second answer, and I have rewritten it to the following:
from collections import Counter
def get_nodes(ranges):
nodes = []
for start, end, label in ranges:
nodes.extend(((start, 0, label), (end + 1, 1, label)))
return sorted(nodes)
def combine(ranges):
if not ranges:
return []
nodes = get_nodes(ranges)
labels = set()
state = Counter()
result = []
start = nodes[0][0]
for node, is_end, label in nodes:
state[label] += [1, -1][is_end]
count = state[label]
if (is_end, count) in {(0, 1), (1, 0)}:
if start < node:
if not count or labels:
result.append((start, node - 1, labels.copy()))
start = node
(labels.remove, labels.add)[count](label)
return result
And it is still very inefficient, I need to process literally millions of rows:
In [2]: for _ in range(128):
...: case = make_generic_case(256, 4096, 16)
...: assert bruteforce_combine(case) == combine(case)
In [3]: for _ in range(2048):
...: case = make_generic_case(512, 2048, 16)
...: assert bruteforce_combine(case) == combine(case)
In [4]: case = make_generic_case(2048, 2**64, 32)
In [5]: %timeit combine(case)
4.19 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: case = make_generic_case(32768, 2**64, 32)
In [7]: %timeit combine(case)
116 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: case = make_generic_case(1048576, 2**64, 32)
In [9]: %timeit combine(case)
5.12 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have data from 6 gigantic CSV files, the total number of rows is:
In [74]: 495209+129884+3748518+1277097+429639+278661
Out[74]: 6359008
That is well over 6 million, merely loading the data into RAM takes 2.9GiB, and I have only 16GiB RAM. I need a solution that is much more efficient, both in time complexity and space complexity.