4

I have a file with a large (0.5-1.5 million) number of lines, each of them is a filename (length is about 50-100 characters). What I need is a fast search through these lines by a given query. Now my code looks like this:

def similarity(haystack, needle):
    words = re.findall(r'\w+', haystack.lower()) # replacing by split with separators reduces time by about 4 seconds

    for word in words:
        if word == needle:
            return 10

    for word in words:
        if word.startswith(needle):
            return 10 ** (len(needle) / len(word))

    if needle in haystack:
        return 1

    return 0

def search(text):
    text = text.lower()
    lines = [(similarity(x, text), x) for x in lines]
    return [x[1] for x in sorted(lines, reverse = True)[:15]]

It runs about 15 seconds on example file at my PC (almost all time is in similarity() function), and I want it to run almost immediately, in a couple of seconds. How can this be done?

I think that indexing may help, but have no idea about its possible structure. And, if possible, I want the search to be "more fuzzy" - e.g. with N-grams or something like that. But the main concern now is the speed.

UPD:

The same lines are searched through multiple times.

needle is always a single word.

"More fuzzy" means that it should find lines even if needle is a bit mistyped.

aplavin
  • 2,199
  • 5
  • 32
  • 53
  • 3
    I'd suggest not to reinvent the wheel and use a dedicated full-text search engine, like [Sphinx](http://sphinxsearch.com/). – georg Feb 28 '12 at 08:54

2 Answers2

5
  1. This line does nothing:

    10 ** (len(t) / len(word))

  2. You need better variable names, as of now it's unclear that "s" and "t" is. Single letter variable names are acceptable only in maths and as loop variables. Are s what you are searching for, or is t what you are searching for? The function as it is used now doesn't make much sense to me.

  3. Since you only match of the first match of anything you search, splitting is pointless in some cases, so you could probably move the split last, but that depends on what you are actually searching for, which is unclear (see 2).

Update: To really get the best performance out of this, you'll need to profile, and test and profile and test. But I'd propose this as a first start:

def similarity(haystack, needle):

    if needle not in haystack:
        return 0

    words = haystack.lower().split()

    if needle in words:
        return 10

    for word in words:
        if word.startswith(needle):
            return 10 ** (len(needle) / len(word))

    return 1
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 1. Of course, there is `return` before this. 2. Ok, changed names to more meaningful. 3. A line is unlikely to contain more than one occurrence of `needle`. – aplavin Feb 28 '12 at 08:42
  • It's rather obvious optimization, but it really helped =) Thanks, now it's executed in about 2-3 seconds. By the way, isn't here a simple way to make search 'more fuzzy'? – aplavin Feb 28 '12 at 09:15
  • @chersanya: Simple, no. More fuzzness than this would require looking for parts of the search string etc. That would be best done by stemming dictionaries, etc, and then you are into full-text search engine mode. In fact, there *is* a simple way to do that: Use a full-text search engine. ;-) Writing one, however, is not simple. – Lennart Regebro Feb 28 '12 at 09:31
  • And what engine would you advice to use? Most of them, as I understood, are for searching a file which contains some pattern, and not for searching a line in single file. – aplavin Feb 28 '12 at 12:00
0

Since you are using the same file to search for a string. If you use a persistent dictionary, you can speed up your search.

Taking your logic into consideration. You can use this.

import shelve
import os

PERSISTENT_DICT_FILENAME = "my_persistent_dict"

def create_a_persitant_dict(haystack_filename):
    pd = shelve.open(PERSISTENT_DICT_FILENAME)
    f = open(haystack_filename)
    for filename in f:
        filename_len = len(filename) 
        filename = filename.lower()
        for i in range(1,filename_len):
            partial_filename = filename[:i]
                calculation = 10 ** ((len(partial_filename)*1.0)/filename_len)
                if pd.has_key(partial_filename):
                        if calculation > pd[partial_filename]:
                            pd[partial_filename] = calculation
                else:
                    pd[partial_filename] = calculation

    pd.close()

def search_string(needle):
    needle = needle.lower()
    pd = shelve.open(PERSISTENT_DICT_FILENAME)
    if pd.has_key(needle):
        return_val = pd[needle]
    else:
        return_val = 0
    pd.close()
    return return_val

if __name__ == "__main__":
    #create_a_persitant_dict("a_large_file.txt")
    needle = raw_input("Enter the string to search")
    print search_string(needle)

Explanation:

create_a_persitant_dict(haystack_filename)

Will create a persistent dictionary reading a large file. The key is a string that is found in the file (Example: if a line in the file is "World.txt" the keys would be "w", "wo", "wor", "worl" ... etc, and the value is the calculation (10 ** etc) for each key.

This is a one time expensive operation only. but the idea is to speed the search.

search_string(needle)

the function will search a string in the persistent dictionary and give you the calculation based on your logic. It will be faster than iterating everytime.

pytroy
  • 1
  • I've tried building inverse index, not for every substring, but only for separated words. It took about 80 Mb (uncompressed). I'm afraid about size of index which you suggested... – aplavin Feb 28 '12 at 14:27