0

I have data from multiple csv files and I need to merge the tables into one table. The data in question is text dump of GeoLite2 database, and there are literally millions of rows, simply loading the data into lists takes 2927MiB.

The tables contain information about IP networks, some tables contain information about ASN, some about city, and some others about country, these tables have different keys (IP networks), and they may contain common keys, I intend to merge these tables into one table containing information about ASN, country and city of all networks listed.

The question is related to my previous question, but it is different.

Imagine an infinite boxes arranged in a line, they are numbered using unique integers, and all are initially empty. This time, all boxes can hold infinitely many values, but they can only hold unique values. Meaning, if you put A into box 0, box 0 contains A, but after that, no matter how many times you put A into box 0, box 0 always contains exactly 1 instance of A. But if you put B into box 0, box 0 now contains A and B. But if you put B into the box again, box 0 still contains 1 instance of A and 1 instance of B.

Now there are many triplets, the first two elements are integers, they correspond to start and end of an integer range (inclusive), each triplet describes a continuous integer range of boxes (meaning the number of every box is the number of the previous box plus one) with the same object.

For example, (0, 10, 'A') means boxes 0 to 10 contain an instance of 'A'.

The task is to combine the information from the triplets and describe the state of the boxes in the least amount of triplets, in this case the third elements are sets.

Input (0, 10, 'A') -> Output (0, 10, {'A'}), explanation: boxes 0 to 10 contain an instance of 'A'.

Input (0, 10, 'A'), (11, 20, 'A') -> Output (0, 20, {'A'}), explanation: boxes 0 to 10 contain an instance of 'A', and boxes 11 to 20 also contain an instance of 'A', 11 is 10 + 1, so boxes 0 to 20 contain an instance of 'A'.

Input (0, 10, 'A'), (20, 30, 'A') -> Output (0, 10, {'A'}), (20, 30, {'A'}), explanation: boxes 0 to 10 contain an instance of 'A', and boxes 20 to 30 also contain an instance of 'A', all other boxes are empty, and 20 is not adjacent to 10, don't merge.

Input (0, 10, 'A'), (11, 20, 'B') -> Output (0, 10, {'A'}), (11, 20, {'B'})

Input (0, 10, 'A'), (2, 8, 'B') -> Output (0, 1, {'A'}), (2, 8, {'A', 'B'}), (9, 10, {'A'}), explanation: boxes 0 to 10 have 'A', while boxes 2 to 8 have 'B', so boxes 2 to 8 have {'A', 'B'}.

Input (0, 10, 'A'), (5, 20, 'B') -> Output (0, 4, {'A'}), (5, 10, {'A', 'B'}), (11, 20, {'B'}) explanation: same as above.

Input (0, 10, 'A'), (5, 10, 'A') -> Output (0, 10, {'A'}), explanation: boxes 0 to 10 have 'A', the second triplet adds no new information and is garbage, discard it.

My current code that produces correct output for some test cases but raises KeyError for others:

import random
from collections import defaultdict
from typing import Any, List, Tuple

def get_nodes(ranges: List[Tuple[int, int, Any]]) -> List[Tuple[int, int, Any]]:
    nodes = []
    for ini, fin, data in ranges:
        nodes.extend([(ini, False, data), (fin, True, data)])
    return sorted(nodes)


def combine_gen(ranges):
    nodes = get_nodes(ranges)
    stack = set()
    actions = []
    for node, end, data in nodes:
        if not end:
            if (action := (data not in stack)):
                if stack and start < node:
                    yield start, node - 1, stack.copy()
                stack.add(data)
                start = node
            actions.append(action)
        elif actions.pop(-1):
            if start <= node:
                yield start, node, stack.copy()
                start = node + 1
            stack.remove(data)


def merge(segments):
    start, end, data = next(segments)
    for start2, end2, data2 in segments:
        if end + 1 == start2 and data == data2:
            end = end2
        else:
            yield start, end, data
            start, end, data = start2, end2, data2
    yield start, end, data


def combine(ranges):
    return list(merge(combine_gen(ranges)))

It produces correct output for the following test cases:

sample1 = [(0, 20, 'A'), (10, 40, 'B'), (32, 50, 'C'), (40, 50, 'D'), (45, 50, 'E'), (70, 80, 'F'), (90, 100, 'G'), (95, 120, 'H'), (131, 140, 'I'), (140, 150, 'J')]
sample2 = [(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')]
sample3 = [(0, 100, 'A'), (10, 25, 'B'), (15, 25, 'C'), (20, 25, 'D'), (30, 50, 'E'), (40, 50, 'F'), (60, 80, 'G'), (150, 180, 'H')]
sample4 = [(0, 16, 'red'), (0, 4, 'green'), (2, 9, 'blue'), (2, 7, 'cyan'), (4, 9, 'purple'), (6, 8, 'magenta'), (9, 14, 'yellow'), (11, 13, 'orange'), (18, 21, 'green'), (22, 25, 'green')]

I won't include the expected output for them here, run my code and you will find out what the outputs are, the outputs are correct.

I have written a function to make test cases and a guaranteed correct but inefficient solution, and my efficient code raises KeyError when fed machine generated inputs.

def make_generic_case(num, lim, dat):
    ranges = []

    for _ in range(num):
        start = random.randrange(lim)
        end = random.randrange(lim)
        if start > end:
            start, end = end, start
        ranges.append([start, end, random.randrange(dat)])

    ranges.sort(key=lambda x: (x[0], -x[1]))
    return ranges


def bruteforce_combine(ranges):
    boxes = defaultdict(set)
    for start, end, data in ranges:
        for n in range(start, end + 1):
            boxes[n].add(data)
    
    boxes = sorted(boxes.items())
    output = []
    lo, cur = boxes.pop(0)
    hi = lo

    for n, data in boxes:
        if cur == data and n - hi == 1:
            hi = n
        else:
            output.append((lo, hi, cur))
            lo = hi = n
            cur = data

    output.append((lo, hi, cur))
    return output

Because my code isn't working properly I CAN'T post it on Code Review, because Code Review only reviews working code and mine isn't.

Answers are required to use make_generic_case(512, 4096, 16) to get test cases and verify proposed solution's correctness against the output of bruteforce_combine, bruteforce_combine is by definition correct (my logic is defaultdict(set)).

What is a more efficient way to combine the overlapping ranges?


Both existing answers are not ideal, the first gives the correct result but is very inefficient and will never finish processing my millions of rows:

In [5]: for _ in range(256):
   ...:     case = make_generic_case(512, 4096, 16)
   ...:     assert bruteforce_combine(case) == combine(case)

In [6]: case = make_generic_case(512, 4096, 16)

In [7]: %timeit combine(case)
9.3 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The second is much more efficient, but I haven't tested thoroughly yet.


I have confirmed the correctness of the code from the second answer, and I have rewritten it to the following:

from collections import Counter

def get_nodes(ranges):
    nodes = []
    for start, end, label in ranges:
        nodes.extend(((start, 0, label), (end + 1, 1, label)))

    return sorted(nodes)


def combine(ranges):
    if not ranges:
        return []
    nodes = get_nodes(ranges)
    labels = set()
    state = Counter()
    result = []
    start = nodes[0][0]
    for node, is_end, label in nodes:
        state[label] += [1, -1][is_end]
        count = state[label]
        if (is_end, count) in {(0, 1), (1, 0)}:
            if start < node:
                if not count or labels:
                    result.append((start, node - 1, labels.copy()))

                start = node

            (labels.remove, labels.add)[count](label)

    return result

And it is still very inefficient, I need to process literally millions of rows:

In [2]: for _ in range(128):
   ...:     case = make_generic_case(256, 4096, 16)
   ...:     assert bruteforce_combine(case) == combine(case)

In [3]: for _ in range(2048):
   ...:     case = make_generic_case(512, 2048, 16)
   ...:     assert bruteforce_combine(case) == combine(case)

In [4]: case = make_generic_case(2048, 2**64, 32)

In [5]: %timeit combine(case)
4.19 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: case = make_generic_case(32768, 2**64, 32)

In [7]: %timeit combine(case)
116 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: case = make_generic_case(1048576, 2**64, 32)

In [9]: %timeit combine(case)
5.12 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I have data from 6 gigantic CSV files, the total number of rows is:

In [74]: 495209+129884+3748518+1277097+429639+278661
Out[74]: 6359008

That is well over 6 million, merely loading the data into RAM takes 2.9GiB, and I have only 16GiB RAM. I need a solution that is much more efficient, both in time complexity and space complexity.

Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35
  • 2
    This sounds like [the skyline problem](https://www.educative.io/answers/the-skyline-problem-in-cpp), only that you wouldn't need the vertical lines. – InSync Jul 30 '23 at 15:44
  • Also worth checking out is the [interval tree](https://en.wikipedia.org/wiki/Interval_tree); for example, Python module [intervaltree](https://pypi.org/project/intervaltree/). – Neil Jul 30 '23 at 17:31
  • @Neil I have checked that, and sadly it is highly inefficient. I have benchmarked it thoroughly. – Ξένη Γήινος Jul 30 '23 at 17:45

3 Answers3

1

Based on the hint of @InSync I created a solution which collects the critical points in a dictionary and processes them:

import random
from collections import defaultdict, Counter

# Original code

def make_generic_case(num, lim, dat):
    ranges = []

    for _ in range(num):
        start = random.randrange(lim)
        end = random.randrange(lim)
        if start > end:
            start, end = end, start
        ranges.append([start, end, random.randrange(dat)])

    ranges.sort(key=lambda x: (x[0], -x[1]))
    return ranges


def bruteforce_combine(ranges):
    boxes = defaultdict(set)
    for start, end, data in ranges:
        for n in range(start, end + 1):
            boxes[n].add(data)
    
    boxes = sorted(boxes.items())
    output = []
    lo, cur = boxes.pop(0)
    hi = lo

    for n, data in boxes:
        if cur == data and n - hi == 1:
            hi = n
        else:
            output.append((lo, hi, cur))
            lo = hi = n
            cur = data

    output.append((lo, hi, cur))
    return output


# My code

def crit_points(ranges):
    result = defaultdict(list)
    
    for r in ranges:
        result[r[1] + 1].append((False, r[2]))

    for r in ranges:
        result[r[0]].append((True, r[2]))


    return sorted(result.items())


def combine(ranges):
    if len(ranges) == 0:
        return []

    active_count = Counter()
    active_idx = 0

    result = []

    for idx, changes in crit_points(ranges):

        setcopy = set(+active_count)

        for s, v in changes:

            if s:
                active_count[v] += 1
            else:
                active_count[v] -= 1

        active_count = +active_count

        if idx > active_idx and setcopy != set(active_count):

            if setcopy:
                result.append((active_idx, idx - 1, setcopy))
            
            active_idx = idx

            

    if idx > active_idx:
            result.append((active_idx, idx - 1, setcopy))


    return result



# Simple test and benchmark

from time import perf_counter

case = make_generic_case(512, 4096, 16)

t = perf_counter()
solution1 = bruteforce_combine(case)
print(f"Brute force: {perf_counter() - t}")

t = perf_counter()
solution2 = combine(case)
print(f"Mine: {perf_counter() - t}")

print(f"Valid: {solution1 == solution2}")
Michael Butscher
  • 10,028
  • 4
  • 24
  • 25
  • 1
    `time.time()` is not as precise as [`time.perf_counter()`](https://docs.python.org/3/library/time.html#time.perf_counter)/`.perf_counter_ns()` when it comes to performance benchmarking. Also, Python has a built-in library for timing small functions: [`timeit`](https://docs.python.org/3/library/timeit.html). – InSync Jul 30 '23 at 17:50
  • @InSync Changed to `perf_counter`, thanks. – Michael Butscher Jul 30 '23 at 18:06
1

Here is a simple strategy.

You are start with lists of tuples of the form (start, end, label).

Convert each tuple into 2: (start, 0, label), (end+1, 1, label).

Sort the tuples. Note that this will put an end on the previous box after the start on the following box, so that we can merge the ranges.

Run through them and produce the answer. Note that we are always emitting ranges that end on the box previous to the one that we are currently processing.

For a large data set, I would actually put the intermediate form into a file, sort that outside of Python, then load from there. But that is an implementation detail. Here is some sample code.

def add_sample (sample, merged=None):
    if merged is None:
        merged = []
    for (start, end, label) in sample:
        merged.append((start, 0, label))
        merged.append((end+1, 1, label))
    return merged

def extract_result (merged):
    merged = sorted(merged)
    state = {}
    if 0 == len(merged):
        return []
    prev_labels = set()
    labels = set()
    state = {}
    answer = []
    start_time = merged[0][0]
    for (time, is_end, label) in merged:
        if is_end == 0:
            new_count = state.get(label, 0) + 1
            if new_count == 1:
                if start_time < time:
                    if 0 < len(labels):
                        answer.append((start_time, time-1, list(sorted(labels))))
                    start_time = time
                labels.add(label)
            state[label] = new_count
        else:
            new_count = state.get(label, 0) - 1
            if new_count == 0:
                if start_time < time:
                    prev_labels = labels.copy()
                    answer.append((start_time, time-1, list(sorted(labels))))
                    start_time = time
                labels.remove(label)
            state[label] = new_count

    return answer

sample1 = [(0, 20, 'A'), (10, 40, 'B'), (32, 50, 'C'), (40, 50, 'D'), (45, 50, 'E'), (70, 80, 'F'), (90, 100, 'G'), (95, 120, 'H'), (131, 140, 'I'), (140, 150, 'J')]
sample2 = [(0, 10, 'A'), (0, 1, 'B'), (2, 5, 'C'), (3, 4, 'C'), (6, 7, 'C'), (8, 8, 'D'), (110, 150, 'E'), (250, 300, 'C'), (256, 270, 'D'), (295, 300, 'E'), (500, 600, 'F')]
sample3 = [(0, 100, 'A'), (10, 25, 'B'), (15, 25, 'C'), (20, 25, 'D'), (30, 50, 'E'), (40, 50, 'F'), (60, 80, 'G'), (150, 180, 'H')]
sample4 = [(0, 16, 'red'), (0, 4, 'green'), (2, 9, 'blue'), (2, 7, 'cyan'), (4, 9, 'purple'), (6, 8, 'magenta'), (9, 14, 'yellow'), (11, 13, 'orange'), (18, 21, 'green'), (22, 25, 'green')]

merged = add_sample(sample1)
merged = add_sample(sample2, merged)
merged = add_sample(sample3, merged)
merged = add_sample(sample4, merged)
print(extract_result(merged))
btilly
  • 43,296
  • 3
  • 59
  • 88
0

You could reuse genomic arithmetic packages, like BEDOPS. What you are trying to do has been a solved problem with low memory overhead in BEDOPS, which was published in roughly 2012.

You would think about your intervals on a "pseudo-chromosome" or fake set name. This is not a big deal, in that putting all data onto one fake "chromosome" is the equivalent of putting all of your intervals into one set, for the purposes of set operations.

You'd likely use a combination of BEDOPS bedops and bedmap tools, i.e. something like:

$ bedops --merge setA.txt setB.txt ... setN.txt > merge.txt
$ bedops --everything setA.txt setB.bxt ... setN.txt > union.txt
$ bedmap --echo --echo-map --delim '\t' merge.txt union.txt > answer.txt

Trying to reinvent the wheel in Python when there is eleven-year old software that has solved this is probably not a good use of your time. Consider looking at genomic interval set operation toolkits.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • I have just checked BEDOPS and it only has Linux and MacOS distributions, I am using Windows 10 and there are no precompiled packages for Windows. I never succeeded in compiling anything (other than Visual Basic 6.0 way back if that counts) and that is why I learned Python. – Ξένη Γήινος Jul 31 '23 at 10:12
  • @ΞένηΓήινος You could use the Linux version in WSL2 – OneCricketeer Aug 06 '23 at 15:12