0

I have written some code in Python 3.6 to autocomplete a user-entered (partial) word. My goal is to make the autocomplete part run as quickly/efficiently as possible, as a way of developing my programming skills and better understanding Python. I've restricted myself to using standard libraries only because I want to explore what can be achieved using such a constaint.

The first part of the code is a function def load_wordfile, that loads a .txt and creates a list of words, lowercasing and cleaning the words along the way. I've tried to speed this up with list comprehension and string.punctuation though it isn't my main focus.

import string
from collections import Counter

def load_wordfile(filename):

    with open(filename, encoding='utf8') as f:

        # List comprehension and string.punctuation for efficiency and speed improvements
        return [word.strip(string.punctuation) for line in f for word in line.lower().split()]

The word list is then passed to a class AutoCompleter during instantiation. Within this class a method complete actually carries out the autocompletion. I want the results of autocompletion as a list of tuples, where the first value in the tuple is the matched word, and the second value is the relative frequency of the word occurence (in the .txt). The returned list of tuples should be in descending order of frequency. I know Counter is good for this so I've made use of it.

class AutoCompleter:

    def __init__(self, word_list):

        # Counter takes care of word counts and descending order
        self.counts = Counter(word_list)

        # Total word count for relative frequency calculation
        total = float(len(word_list))

        # Change the counts to frequencies
        for k, v in self.counts.items():
            self.counts[k] = v/total

        # Turn Counter dict into a list in descending order
        self.counts = self.counts.most_common()

    def complete(self, pattern):

        # Clean user entered word
        pattern = pattern.lower().strip(string.punctuation)

        # This is the potential bottleneck
        completions = [pair for pair in self.counts if pair[0].startswith(pattern)]

        return completions

I managed to improve upon my first attempts greatly, in terms of speed of execution, which resulted in the code you see above.

However, I believe complete could still be improved. To demonstrate where I am right now, I benchmarked complete on the NLTK corpus of English words, figuring this is a nice "upper bound" on the size of a list that needs to be searched for matches.

from nltk.corpus import words

# words.words() is a list of over 250,000 unique English words
cmp = AutoCompleter(words.words())
pattern = 'do'

%timeit -n 100 cmp.complete(pattern)

The result from timeit was:

41.6 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From extensive research I think I can perhaps use some sort of binary search or bisect to speed up the search for matching (partial) words against the users input. Or perhaps structure the input into complete differently, finding the index of matching words then pulling the frequencies out from a different list, and finally outputting the two together.

I also have an idea, in that is a user types "a" then I should only focus the search on words that begin with "a" rather than every word in the passed list. However now I'm getting a bit muddled up on how to best implement that. I have looked around Stack Overflow too, which helped with other aspects but I haven't quite managed to put all the pieces together yet for a final performance boost.

Your help and suggestions are much appreciated. Thank you!

Bango
  • 155
  • 1
  • 9
  • 1
    I'd recommend using https://codereview.stackexchange.com/ – depperm Sep 28 '17 at 14:49
  • 1
    Your intuition about only searching for words starting with `a` when the user types `a` is correct. The data structure implementing this intuition is called a "trie", and a common use of it is in autocompletion. – jme Sep 28 '17 at 14:51
  • @depperm thank you for pointing that out, I'll bear that in mind in the future. – Bango Sep 28 '17 at 14:53
  • @jme thanks, I see there are external libraries for that, but staying true to my constraint I guess I'd look into implementing a trie manually – Bango Sep 28 '17 at 14:53
  • Of course. Then there are other StackOverflow answers which are more on topic, such as [How to create a TRIE in Python](https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python). As such I've voted to close this question as being too broad. – jme Sep 28 '17 at 14:55

0 Answers0