Python Binary Search for Dictionary

Question

I have a huge .txt.gz file with ~800kb of words and score values formatted as "value word" for each line and sorted by word. What would be the fastest way to lookup the value of a given word?

I can read a file using:

import gzip

f = gzip.open('somefile.txt.gz', 'rb')
file_content = f.read()

Would the quickest method be a binary search of some sort?

sample data:

0.090909 chevre#n#1

0.058824 chevron#n#1

0.071429 chevron#n#2

0.071429 chevrotain#n#1

0.166667 chewa#n#1

Do you only need to look up the value of one word? If so it may be quickest just to scan for that word as you read the file in - time complexity O(n). Otherwise it will be a race between binary search (O(log n) - about 15 compares needed in your case) and a dict (O(1) it is a hash table). If this is critical, you will need to time the two methods. — Bull, May 14 '13 at 23:22
I can't think of a way you could do this with a binary search without loading all of the data into another data structure. It really depends on how much you will be looking up. If you don't have to do a lot of lookups then a linear search won't be too bad. If you don't want to load the into a new data structure like a Dictionary then a linear search is the only feasible option I can think of. — minhaz1, May 14 '13 at 23:31
I'd start with a basic function to open the file and read linearly until the word is found. After that, you're going to have to investigate search through compressed files; e.g., http://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives — dan, May 14 '13 at 23:35
Im not sure how to read the file line by line I dont think this is possible because they are txt.gz and it decompresses all at once — , May 14 '13 at 23:40
Oh okay you must be new at this :) The gzip module is like a regular file object, so repeatedly calling f.readline() should work. — dan, May 14 '13 at 23:47

Henry Keiter · Accepted Answer · 2013-05-14T23:49:58.597

1

A binary search would be an effective way to do something like this, but you're still going to have to move the data from text (just a bunch of bytes, after all) into some other data structure like a list. If you have a program with a short lifetime, or no long-term memory restrictions, it will (probably) be even faster to just load the whole thing into a Python dict at startup (or whenever is appropriate):

# This may not work exactly right for your file format, but you get the idea.
lookup = {}
for line in f:
    if line:
        value, key = line.trim().split():
        lookup[key] = value

Then you access it using Python's builtin dictionaries, which are nice and fast:

def get_value(word):
    return lookup.get(word)

EDIT

If your only option is to read in the whole file for each word, and you're searching for many words, then the time you save by implementing a clever search algorithm is probably going to be somewhat marginal compared to the time you spend opening and reading files over and over. What you really want is a database, which could actually handle this sort of thing quickly. That said, given these parameters, I'd probably do something like this if I had to use the filesystem:

import bisect

# Define searchable (word, value) tuples for every word in the file.
# I'm assuming your files are sorted, but if not, sort this list (SLOW!!)
words = [(w[1], w[0]) for w in (line.strip().split() for line in f if line)]

# Binary search for the word and return its associated value.
def get_value(word):
    idx = bisect.bisect_left(words, (word,None)) # Tuples compare element-wise
    if idx != len(words) and words[idx][0] == word:
        return words[idx][1]
    raise ValueError('word not found')

Finally, I notice you're using gzipped files, which is sensible if storage space is an issue, but it's going to slow your process down even more. Once again I have to suggest a database. In any case, I don't know if you're even having trouble here, but just in case, reading gzipped files is not really any "harder" than reading normal files. Just take a look at the gzip module. Basically, gzip files work just like regular files, so you can still write for line in file and so on.

edited May 14 '13 at 23:49

answered May 14 '13 at 23:12

Henry Keiter

16,863
7
51
80

I have hundreds of thousands of these large text files. (Each one links a word to every other word and gives a score of relatedness). It is not possible to load them all into python dicts. I am looking for the fastest way to find the correct line containing the word I want to read the score value. I would have thought a binary search would be the better than going through line by line so I am looking for a method for that. But im not sure if bisect can be used because the value precedes the sorted words. – May 14 '13 at 23:20
2

It sounds like you have a large amount of data. Would you consider loading into a database engine such as mysql? – Miles May 14 '13 at 23:30
1

@user759885 So there is 80GB of data, not just 800kB? If you want a good answer you will need to provide more information about what you are trying to do. – Bull May 14 '13 at 23:36
1

@user759885 I've edited my answer to include my thoughts on your situation and an example implementation, but Miles basically beat me to the punch: you're going to be spending a ton of time opening and reading files this way; what you really want is a database to query. – Henry Keiter May 14 '13 at 23:40
100gb data in total across many text files. Each file is named after a word. I already have that word and that file. I want to then find the relatedness score of that word with a second word which is stored in the file I have decompressed. So the problem is to search that one file for the second word. Each file is 800kb – May 14 '13 at 23:44
I will not be opening many files each time. I just have many files available because it is a large dataset. – May 14 '13 at 23:46
@user759885 You don't have to open many files each time for it to be slow. Reading a single 800k file into memory is slow, especially if you have to unzip it first. – Henry Keiter May 14 '13 at 23:47
I dont have the hard disk space to index and decompress into sql it is already 100gb+ compressed. But it is quite fast to find the file I want because it is just in a folder by letter and then named by word. I then decompress the one I want and then need to find 4 or 5 words in that file and get their values. The alternative might be 100 thousand tables in mysql – May 14 '13 at 23:51
1

@user759885 Have you read my edits? I know what you're trying to do and I've already explained how to do it. I'm just warning you that it's going to be slow. 800k is a lot of data to physically read from the disk whenever you need to change what file you're looking at, and decompressing takes time as well. – Henry Keiter May 14 '13 at 23:56

Python Binary Search for Dictionary

1 Answers1