I have a file with a large (0.5-1.5 million) number of lines, each of them is a filename (length is about 50-100 characters). What I need is a fast search through these lines by a given query. Now my code looks like this:
def similarity(haystack, needle):
words = re.findall(r'\w+', haystack.lower()) # replacing by split with separators reduces time by about 4 seconds
for word in words:
if word == needle:
return 10
for word in words:
if word.startswith(needle):
return 10 ** (len(needle) / len(word))
if needle in haystack:
return 1
return 0
def search(text):
text = text.lower()
lines = [(similarity(x, text), x) for x in lines]
return [x[1] for x in sorted(lines, reverse = True)[:15]]
It runs about 15 seconds on example file at my PC (almost all time is in similarity()
function), and I want it to run almost immediately, in a couple of seconds. How can this be done?
I think that indexing may help, but have no idea about its possible structure. And, if possible, I want the search to be "more fuzzy" - e.g. with N-grams or something like that. But the main concern now is the speed.
UPD:
The same lines
are searched through multiple times.
needle
is always a single word.
"More fuzzy" means that it should find lines even if needle
is a bit mistyped.