Bisection search in the sorted lines of an opened file (not loaded in memory)

Question

I have a text file of multiple gigabytes, and the millions of lines are sorted:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

How is it possible, without loading the whole file in memory, to search if a line is existent, with a bisection search? (Possibly in O(log n) in the number of lines)

Is there a function like bisect.bisect_left among the lines of a file f = open('file.txt', 'r') in the Python library?

The window would initially be [a, b] = [0, file_size]. Then it would seek in the file at position m=(a+b)/2, look for the next \n, and read the following line l. If the pattern to search is smaller or greater than l (with lexicographic order), then we continue on [m, b] or [a, m]. Before rolling my own, does this exist in Python?

@timgeb I don't exactly understand your last comment "surprised if there is ... that wouldn't ...", too complex logically for me ;) because I'm not a native english speaker. Do you think that *there is* better way than bisection, or that there *isn't*? — Basj, Feb 22 '22 at 13:49
Do you think if you can collect all line offsets using `mmap` and then seek to the line based on the algorithm to check the line? It might not be straight forward use of `bisect` module any way, but still. — Kris, Feb 22 '22 at 14:04

score 3 · Accepted Answer · edited Feb 22 '22 at 15:45

You can use the mmap built-in module. It provides random access to files (i.e., a file behaves like a large array of bytes stored in the file system). You can find more info here.

import mmap

def bisect_search(file_path, line):
    line = line.encode()
    with open(file_path, 'r+b') as f:
        mm = mmap.mmap(f.fileno(), 0)
        lo = 0
        hi = mm.size()
        while lo < hi:
            mid = (lo + hi) // 2
            left_endl_idx = mm.rfind(b'\n', lo, mid)
            right_endl_idx = mm.find(b'\n', mid, hi)
            if left_endl_idx == -1:
                left_endl_idx = lo - 1
            if right_endl_idx == -1:
                right_endl_idx = hi
            mid_line = mm[left_endl_idx + 1: right_endl_idx]
            if mid_line == line:
                return True
            if mid_line < line:
                lo = right_endl_idx + 1
            else:
                hi = left_endl_idx
    return False

The function returns True if line exists within the file, False otherwise. Let's use the following myfile.txt file to run a few examples:

aaaaa
bcijkjf
dfsdf
gdfgdfhqiuzhf
zzdiszfhj

>>> bisect_search('myfile.txt', 'hello')
False
>>> bisect_search('myfile.txt', 'aaaaa')
True
>>> bisect_search('myfile.txt', 'aaaa')
False
>>> bisect_search('myfile.txt', 'dfsdf')
True
>>> bisect_search('myfile.txt', 'zzdiszfhjj')
False

This function should be much faster than a linear search on a big file.

Note: this code works with \n endings, and not currently with \r\n Windows-style endings (not necessary for OP).

In my case (I want to do thousands of queries), I modified your code to do the `open(...) as f` and `mm = mmap...` only once, and then I pass `mm` as argument of `bisect_search` instead of passing the filename. Thanks again! — Basj, Feb 22 '22 at 15:41
@Basj Just want to mention that the complexity of this function is still linear in the worst case (i.e., if the file contains only one line). And of course you can't improve it: the input line and the line in the file have to be compared character by character. But this should not be your case :) So the complexity is linear in the number of characters (worst case, not the average one), and log(n) in the number of lines — Riccardo Bucco, Feb 22 '22 at 15:41
O(Log(n)) in the number of lines is what I was looking for, perfect! — Basj, Feb 22 '22 at 15:44
Note: it is even faster than another solution I tried with Sqlite: https://stackoverflow.com/a/71223300/1422096 — Basj, Feb 22 '22 at 18:53

Bisection search in the sorted lines of an opened file (not loaded in memory)

1 Answers1

Linked