0

I have a huge .txt.gz file with ~800kb of words and score values formatted as "value word" for each line and sorted by word. What would be the fastest way to lookup the value of a given word?

I can read a file using:

import gzip

f = gzip.open('somefile.txt.gz', 'rb')
file_content = f.read()

Would the quickest method be a binary search of some sort?

sample data:

0.090909 chevre#n#1

0.058824 chevron#n#1

0.071429 chevron#n#2

0.071429 chevrotain#n#1

0.166667 chewa#n#1

Community
  • 1
  • 1
  • Do you only need to look up the value of one word? If so it may be quickest just to scan for that word as you read the file in - time complexity O(n). Otherwise it will be a race between binary search (O(log n) - about 15 compares needed in your case) and a dict (O(1) it is a hash table). If this is critical, you will need to time the two methods. – Bull May 14 '13 at 23:22
  • I can't think of a way you could do this with a binary search without loading all of the data into another data structure. It really depends on how much you will be looking up. If you don't have to do a lot of lookups then a linear search won't be too bad. If you don't want to load the into a new data structure like a Dictionary then a linear search is the only feasible option I can think of. – minhaz1 May 14 '13 at 23:31
  • I'd start with a basic function to open the file and read linearly until the word is found. After that, you're going to have to investigate search through compressed files; e.g., http://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives – dan May 14 '13 at 23:35
  • Im not sure how to read the file line by line I dont think this is possible because they are txt.gz and it decompresses all at once –  May 14 '13 at 23:40
  • Oh okay you must be new at this :) The gzip module is like a regular file object, so repeatedly calling f.readline() should work. – dan May 14 '13 at 23:47

1 Answers1

1

A binary search would be an effective way to do something like this, but you're still going to have to move the data from text (just a bunch of bytes, after all) into some other data structure like a list. If you have a program with a short lifetime, or no long-term memory restrictions, it will (probably) be even faster to just load the whole thing into a Python dict at startup (or whenever is appropriate):

# This may not work exactly right for your file format, but you get the idea.
lookup = {}
for line in f:
    if line:
        value, key = line.trim().split():
        lookup[key] = value

Then you access it using Python's builtin dictionaries, which are nice and fast:

def get_value(word):
    return lookup.get(word)

EDIT

If your only option is to read in the whole file for each word, and you're searching for many words, then the time you save by implementing a clever search algorithm is probably going to be somewhat marginal compared to the time you spend opening and reading files over and over. What you really want is a database, which could actually handle this sort of thing quickly. That said, given these parameters, I'd probably do something like this if I had to use the filesystem:

import bisect

# Define searchable (word, value) tuples for every word in the file.
# I'm assuming your files are sorted, but if not, sort this list (SLOW!!)
words = [(w[1], w[0]) for w in (line.strip().split() for line in f if line)]

# Binary search for the word and return its associated value.
def get_value(word):
    idx = bisect.bisect_left(words, (word,None)) # Tuples compare element-wise
    if idx != len(words) and words[idx][0] == word:
        return words[idx][1]
    raise ValueError('word not found')

Finally, I notice you're using gzipped files, which is sensible if storage space is an issue, but it's going to slow your process down even more. Once again I have to suggest a database. In any case, I don't know if you're even having trouble here, but just in case, reading gzipped files is not really any "harder" than reading normal files. Just take a look at the gzip module. Basically, gzip files work just like regular files, so you can still write for line in file and so on.

Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
  • I have hundreds of thousands of these large text files. (Each one links a word to every other word and gives a score of relatedness). It is not possible to load them all into python dicts. I am looking for the fastest way to find the correct line containing the word I want to read the score value. I would have thought a binary search would be the better than going through line by line so I am looking for a method for that. But im not sure if bisect can be used because the value precedes the sorted words. –  May 14 '13 at 23:20
  • 2
    It sounds like you have a large amount of data. Would you consider loading into a database engine such as mysql? – Miles May 14 '13 at 23:30
  • 1
    @user759885 So there is 80GB of data, not just 800kB? If you want a good answer you will need to provide more information about what you are trying to do. – Bull May 14 '13 at 23:36
  • 1
    @user759885 I've edited my answer to include my thoughts on your situation and an example implementation, but Miles basically beat me to the punch: you're going to be spending a ton of time opening and reading files this way; what you really want is a database to query. – Henry Keiter May 14 '13 at 23:40
  • 100gb data in total across many text files. Each file is named after a word. I already have that word and that file. I want to then find the relatedness score of that word with a second word which is stored in the file I have decompressed. So the problem is to search that one file for the second word. Each file is 800kb –  May 14 '13 at 23:44
  • I will not be opening many files each time. I just have many files available because it is a large dataset. –  May 14 '13 at 23:46
  • @user759885 You don't have to open many files each time for it to be slow. Reading a single 800k file into memory is slow, especially if you have to unzip it first. – Henry Keiter May 14 '13 at 23:47
  • I dont have the hard disk space to index and decompress into sql it is already 100gb+ compressed. But it is quite fast to find the file I want because it is just in a folder by letter and then named by word. I then decompress the one I want and then need to find 4 or 5 words in that file and get their values. The alternative might be 100 thousand tables in mysql –  May 14 '13 at 23:51
  • 1
    @user759885 Have you read my edits? I know what you're trying to do and I've already explained how to do it. I'm just warning you that it's going to be slow. 800k is a lot of data to physically read from the disk whenever you need to change what file you're looking at, and decompressing takes time as well. – Henry Keiter May 14 '13 at 23:56