0

Given word and a vocabulary I would like to find all entries in the vocabulary with the longest common prefix.

Here are some examples:

> vocabulary = {"english", "englishx", "english-indian", "enya", "spanish"}
> prefix, words = find_by_longest_prefix(vocabulary, "englishy")
english: ['englishx', 'english', 'english-indian']
> prefix, words = find_by_longest_prefix(vocabulary, "entomology")
en: ['enya', 'englishx', 'english', 'english-indian']
> prefix, words = find_by_longest_prefix(vocabulary, "spania
spani: ['spanish']

Imagine now you have to call this method multiple times for different input words but same vocabulary. My naive implementation finds matching words in O(n) time (n is lenght of input word), but takes a lot of memory: O(n*m), where n is the size of vocabulary and m is the length of the longest word in the vocabulary.

from collections import defaultdict


def build_words_by_prefix(vocabulary):
    words_by_prefix = defaultdict(list)
    for word in vocabulary:
        for i in range(1, len(word) + 1):
            prefix = word[:i]
            words_by_prefix[prefix].append(word)
    return words_by_prefix


def find_by_longest_prefix(vocabulary, word):
    words_by_prefix = build_words_by_prefix(vocabulary)
    for i in range(len(word)+1, 1, -1):
        prefix = word[:i]
        words = words_by_prefix.get(prefix, False)
        if words:
            return prefix, words
    return False

I'm looking for a solution that is both memory and time efficient. I've heard about tries for more efficient storing of prefixes but I wonder how can I use them here?

dzieciou
  • 4,049
  • 8
  • 41
  • 85
  • 4
    You could use a tree structure called "trie" to represent the vocabulary. See [How to create a trie in Python](https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python) – Błotosmętek May 15 '20 at 09:12
  • Many thanks to @Asocia for the discussion. I've added some additional information to the question. – dzieciou May 15 '20 at 10:44
  • 2
    you should probably just put the words in a sorted list and do a binary search. It'll be faster and more compact than walking a trie in python. – Matt Timmermans May 15 '20 at 11:59
  • @MattTimmermans Good point. Looks like this can be implemented using bisect python built-in: https://stackoverflow.com/questions/7380629/perform-a-binary-search-for-a-string-prefix-in-python – dzieciou May 15 '20 at 15:13

1 Answers1

1

Your approach will give relatively fast solutions. Without changing your data structure you could still speed it up (on average) by replacing the reversed loop by a binary search. On average this will result in fewer lookups.

If however you need to reduce the memory consumption, you could make a trie, i.e. where you don't store the prefixes as keys, but single characters. Nested nodes have the letters as keys that come at the next position, ...etc. The node that comes after the last character of a valid word, will have a boolean property set to True. It still could have nested nodes, in case the word has valid extensions too.

I suggest here a Trie implementation that uses more memory than described above, but avoids building a word from single characters at each lookup. The boolean in this trie variant is replaced by the actual word.

class Trie(dict):
    word = ""

    def add(self, *words):
        for word in words:
            node = self
            for ch in word:
                if ch not in node:
                    node[ch] = Trie()
                node = node[ch]
            node.word = word

    def words(self):
        if self.word:
            yield self.word
        for ch in self:
            yield from self[ch].words()

    def longestprefix(self, word):
        node = self
        prefix = word
        for i, ch in enumerate(word):
            if ch not in node:
                prefix = word[:i]
                break
            node = node[ch]
        return (prefix, list(node.words()))

trie = Trie()
trie.add("english", "englishx", "english-indian", "enya", "spanish")
print(trie.longestprefix("englishy"))
print(trie.longestprefix("entomology"))
print(trie.longestprefix("spania"))
trincot
  • 317,000
  • 35
  • 244
  • 286