2

I have a file where each line is ordered alphabetically. The file is 12Gb, which means I can't simply read it line by line. The data looks like this:

brown    0    1    0    1    2
fox      3    5    0    0    1
jumped   2    0    6    1    0

The words at the beginning of each line are unique. The word and the numbers on each line are separated by tabs. I want to be able to query the file for specific keywords. For example, if I query "fox", the program should return "fox 3 5 0 0 1".

It seems that a good candidate for this would be the bisect module: https://docs.python.org/3.0/library/bisect.html

I found a post which uses bisect to find out the line number of a keyword: How do I perform binary search on a text file to search a keyword in python?

This is what the code looks like:

import bisect
import os

class Query(object):
    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]

class FileSearcher(object):
    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)

with open('myfile') as file_to_search:
    query = 'fox\t' #token to query
    wrapped_query = Query(query)
    searchable_file = FileSearcher(file_to_search)
    linepos = bisect.bisect(searchable_file, wrapped_query)
    print "Located @ line: ", linepos
    #print content of line?

However, I can't figure out how to actually print the content of the line. I should at least add a read statement somewhere, but I don't know where.

Is it possible to print the content of the line with the bisect module?

Community
  • 1
  • 1
kormak
  • 495
  • 2
  • 5
  • 15
  • 1
    I don't think the referenced approach using `bisect` will work for your data file because it has variable length records (based on the information you've provided in your question). If the fields were separated by spaces and defined to always occupy certain fixed character positions in each record, then it would work. – martineau Dec 01 '14 at 19:43
  • 1
    you have a 12Gb files composed of lines all starting by different words? – njzk2 Dec 01 '14 at 19:44
  • from your code, doesn't `searchable_file[linepos]` print the searched line? – njzk2 Dec 01 '14 at 19:47
  • (that's ~42 million different words, which is quite a lot) – njzk2 Dec 01 '14 at 19:48
  • Yes, as @martineau says, it doesn't work. – BartoszKP Dec 01 '14 at 19:49
  • Also, you records appear to be 30-chars long, not 35. (at least in your example.) – njzk2 Dec 01 '14 at 19:58
  • @njzk2 Yes, each line starts with a different word. From what I can tell, searchable_file[linepos] only prints the very begining of the line. – kormak Dec 01 '14 at 20:06
  • @kormak: `searchable_file[linepos]` calls `__getitem__`, so it should print the complete line. However, because of how bisect works, it is likely to print the next line (bisect gives you the insert position to keep the order. Often it follows the matching line). If this does not work, elaborate on `searchable_file[linepos] only prints the very begining of the line.` What does it print exactly? In the case of your 3 lines example? – njzk2 Dec 01 '14 at 20:09
  • With the example I gave in my question, searchable_file[linepos] returns the following: Located @ line: 2 0 6 1 0 So it gets that 'fox' is on line 2, but it prints the end of line 3... – kormak Dec 01 '14 at 20:14
  • @kormak It gets the correct line number by accident. Also I even doubt it's correct, as it's probably 0-based. – BartoszKP Dec 01 '14 at 20:16
  • So I'm guessing the consensus is that bisect cannot be used on data of variable length? Any other alternatives? – kormak Dec 01 '14 at 20:19
  • @kormak If you could build an index (map: record no. -> record position in the file) then you could use bisect on that index. – BartoszKP Dec 01 '14 at 20:22
  • @BartoszKP What would be the difference between the record number and the record position? I'm a bit confused. – kormak Dec 01 '14 at 20:30
  • @kormak Sorry, I've introduced unneeded terms: record number = line number, record position = record's position in the file (in bytes). – BartoszKP Dec 01 '14 at 23:21
  • @kormak: The issue with building an index that maps record number to file offset is that it will require you to read the entire file in order to build it. Afterwards it will be possible to use `bisect` to locate the offset of the beginning of each individual record very quickly. The end of each record will be the newline character each has at its end. Because of the overhead of reading the whole file and building the index, this approach would only be feasible if the file doesn't change or doesn't very often (because you can save the index in a separate file and reuse it). – martineau Dec 02 '14 at 00:43
  • @kormak: Another alternative would be to change the format of the information file so that it had fixed-sized records. However this would likely make it be even longer (or you may have no control over its format). It would also be possible (and easy) to build the index as the file was created which would allow you to keep its current variable-length record format. – martineau Dec 02 '14 at 00:51

3 Answers3

0

If you want go with Python solution, you can do the following:

  • Read file by small chunks of MAX_LINE bytes, each time moving forward by fixed offset
  • That offset determines block size
  • For each such read, determine the key (first word in a line)
  • These keys serve as delimiters of blocks
  • Construct the list of such keys. The list would be sorted as keys are ordered
  • You may persist such list somewhere via pickle/json.dumps/...
  • When quering, find via bisect the index of a block where you key is located
  • Read that block entirely and find the key with data

Here is the example file bigfile:

abc 4
bar 2
baz 3
egg 6
foo 1
god 8
ham 5
sex 7

The code:

import os
from bisect import bisect

MAX_LINE = 7
BLOCK_SIZE = 10

def parse_chunks(filename):

    size = os.path.getsize(filename)
    chunks = []

    with open(filename, 'rb') as file:
        block = str(file.read(MAX_LINE*2))
        first_line = block[:block.find('\n') + 1]
        chunks.append(first_line.split()[0])

        pos = BLOCK_SIZE
        while pos < size:
            file.seek(pos)
            block = str(file.read(MAX_LINE*2))
            first_eol = block.find('\n')
            second_eol = block.find('\n', first_eol + 1)
            if first_eol == -1 or second_eol == -1:
                break

            line = block[first_eol + 1:second_eol]

            key = line.split()[0]
            chunks.append(key)

            pos += BLOCK_SIZE

    return chunks


if __name__ == '__main__':
    BLOCK_SIZE = 10
    filename = 'bigfile'
    chunks = parse_chunks(filename)

    query = 'abc'
    pos_before = bisect(chunks, query) - 1

    with open(filename, 'rb') as file:
        file.seek(pos_before*BLOCK_SIZE)
        block = str(file.read(BLOCK_SIZE + MAX_LINE))
        line_start = block.find(query)
        line_end = block.find('\n', line_start + 1)
        line = block[line_start:line_end]

        print(line)

In this toy example I use block size of 10 bytes, in your case of 12GB file I'd suggest you to start with 1M.

Anton Zuenko
  • 721
  • 3
  • 13
  • I tried your code on the actual data using different values for BLOCK_SIZE, but I always get an IOError: [Errno 22] Invalid argument on the following line: file.seek(pos_before*BLOCK_SIZE) – kormak Dec 01 '14 at 21:27
  • Try again, I've fixed a bug when querying from first block – Anton Zuenko Dec 01 '14 at 21:44
  • Now it returns the error IndexError: list index out of range on the following line: chunks.append(first_line.split()[0]) – kormak Dec 02 '14 at 15:09
0

The following recursive function should be able to narrow the search interval. I'm not sure that you can modify it so that it returns a match or None for no match.

def bisearch(f, word, i, j)
    if (j-1)<1E6: return i,j

    k = (i+j)/2
    f.seek(k)


    while k<j:
        c = f.read(1)
        k = k+1
        if c == '\n': break
    else:
        # ??? no match ??? I'm not sure

    w = []
    while 1:
        c = f.read(1)
        if c == '\t': break
        w.append(c)

    w = "".join(w)
    if w == word:
         return k, k
    if w < word:
         return bisearch(f, word, k, j)
    else:
         return bisearch(f, word, i, k)

and here an example of usage

word = ...
f = open(...)

i,j = bisearch(f, word, 0, len_f)

f.seek(i)
if i==j:
    line = f.readline()
else:
#################### EDIT ################
#   OLD
#   buffer = f.read(1E6)
#   NEW
    buffer = f.read(j-i)
    lenw = len(word)
    for line in buffer.split('\n'):
        if line[:lenw] == word: break
    else:
        # no matches, SOS

result = process(line)
gboffi
  • 22,939
  • 8
  • 54
  • 85
  • Your code returned TypeError: integer argument expected, got float on the line buffer = f.read(1E6). I solved it with buffer = f.read(int(1E6)). The main problem is how to define len_f. To test your code on my fox example, I used len_f = sum(1 for line in f), and it worked fine. But on the 12gb file it would probably take hours to compute the length of the file in that way. – kormak Dec 02 '14 at 15:20
  • (1) sorry for the type error, I didn't test my code... (2) [The OS knows the length of the file](http://stackoverflow.com/a/2104107/2749397)... see the linked answer for the solution (3) Does it really works? – gboffi Dec 03 '14 at 11:36
-1

Try seeking to the line in question and using readline.

print "Located @ line: ", linepos
file_to_search.seek(linepos)
line = file_to_search.readline()

This is assuming linepos is the position of the line, counted in bytes from the beginning of the file. If it's the position counted in line numbers, you'll need to multiply by the number of bytes per line before seeking.

print "Located @ line: ", linepos
file_to_search.seek(linepos * searchable_file.record_size)
line = file_to_search.readline()
Kevin
  • 74,910
  • 12
  • 133
  • 166