I have written some code in Python 3.6 to autocomplete a user-entered (partial) word. My goal is to make the autocomplete part run as quickly/efficiently as possible, as a way of developing my programming skills and better understanding Python. I've restricted myself to using standard libraries only because I want to explore what can be achieved using such a constaint.
The first part of the code is a function def load_wordfile
, that loads a .txt and creates a list of words, lowercasing and cleaning the words along the way. I've tried to speed this up with list comprehension and string.punctuation
though it isn't my main focus.
import string
from collections import Counter
def load_wordfile(filename):
with open(filename, encoding='utf8') as f:
# List comprehension and string.punctuation for efficiency and speed improvements
return [word.strip(string.punctuation) for line in f for word in line.lower().split()]
The word list is then passed to a class AutoCompleter
during instantiation. Within this class a method complete
actually carries out the autocompletion. I want the results of autocompletion as a list of tuples, where the first value in the tuple is the matched word, and the second value is the relative frequency of the word occurence (in the .txt). The returned list of tuples should be in descending order of frequency. I know Counter
is good for this so I've made use of it.
class AutoCompleter:
def __init__(self, word_list):
# Counter takes care of word counts and descending order
self.counts = Counter(word_list)
# Total word count for relative frequency calculation
total = float(len(word_list))
# Change the counts to frequencies
for k, v in self.counts.items():
self.counts[k] = v/total
# Turn Counter dict into a list in descending order
self.counts = self.counts.most_common()
def complete(self, pattern):
# Clean user entered word
pattern = pattern.lower().strip(string.punctuation)
# This is the potential bottleneck
completions = [pair for pair in self.counts if pair[0].startswith(pattern)]
return completions
I managed to improve upon my first attempts greatly, in terms of speed of execution, which resulted in the code you see above.
However, I believe complete
could still be improved. To demonstrate where I am right now, I benchmarked complete
on the NLTK corpus of English words, figuring this is a nice "upper bound" on the size of a list that needs to be searched for matches.
from nltk.corpus import words
# words.words() is a list of over 250,000 unique English words
cmp = AutoCompleter(words.words())
pattern = 'do'
%timeit -n 100 cmp.complete(pattern)
The result from timeit was:
41.6 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
From extensive research I think I can perhaps use some sort of binary search or bisect
to speed up the search for matching (partial) words against the users input. Or perhaps structure the input into complete
differently, finding the index of matching words then pulling the frequencies out from a different list, and finally outputting the two together.
I also have an idea, in that is a user types "a" then I should only focus the search on words that begin with "a" rather than every word in the passed list. However now I'm getting a bit muddled up on how to best implement that. I have looked around Stack Overflow too, which helped with other aspects but I haven't quite managed to put all the pieces together yet for a final performance boost.
Your help and suggestions are much appreciated. Thank you!