A binary search would be an effective way to do something like this, but you're still going to have to move the data from text (just a bunch of bytes, after all) into some other data structure like a list. If you have a program with a short lifetime, or no long-term memory restrictions, it will (probably) be even faster to just load the whole thing into a Python dict
at startup (or whenever is appropriate):
# This may not work exactly right for your file format, but you get the idea.
lookup = {}
for line in f:
if line:
value, key = line.trim().split():
lookup[key] = value
Then you access it using Python's builtin dictionaries, which are nice and fast:
def get_value(word):
return lookup.get(word)
EDIT
If your only option is to read in the whole file for each word, and you're searching for many words, then the time you save by implementing a clever search algorithm is probably going to be somewhat marginal compared to the time you spend opening and reading files over and over. What you really want is a database, which could actually handle this sort of thing quickly. That said, given these parameters, I'd probably do something like this if I had to use the filesystem:
import bisect
# Define searchable (word, value) tuples for every word in the file.
# I'm assuming your files are sorted, but if not, sort this list (SLOW!!)
words = [(w[1], w[0]) for w in (line.strip().split() for line in f if line)]
# Binary search for the word and return its associated value.
def get_value(word):
idx = bisect.bisect_left(words, (word,None)) # Tuples compare element-wise
if idx != len(words) and words[idx][0] == word:
return words[idx][1]
raise ValueError('word not found')
Finally, I notice you're using gzipped files, which is sensible if storage space is an issue, but it's going to slow your process down even more. Once again I have to suggest a database. In any case, I don't know if you're even having trouble here, but just in case, reading gzipped files is not really any "harder" than reading normal files. Just take a look at the gzip module. Basically, gzip files work just like regular files, so you can still write for line in file
and so on.