0

I have a list of strings, for example:

py
python
co
comp
computer

I simply want to get a string, which contains the biggest possible amount of prefixes. The result should be 'computer' because its prefixes are 'co' and 'comp' (2 prefixes).

I have this code (wordlist is a dictionary):

for i in wordlist:
    word = str(i)
    for j in wordlist:
        if word.startswith(j):
            wordlist[i] += 1
result = max(wordlist, key=wordlist.get)

Is there any better, faster way to do that?

jsonode
  • 1
  • 2
  • Well one improvement would be to cut the `word = str(i)` part out and simply the variable `i`. If `i` is already a string, there is no reason to convert it. Unless I'm missing something very obvious... – Christian Dean Feb 13 '17 at 20:04
  • You might be able to use a radix tree, but I think that would still be O(n^2) – Patrick Haugh Feb 13 '17 at 20:04
  • @PatrickHaugh Why would it be O(n**2)? You could simply iterate over all the leaves and count the terminal nodes between leaf and root node. – Eric Duminil Feb 13 '17 at 20:08
  • @EricDuminil You'd have to build the tree as well. I think that's worst-case n**2 – Patrick Haugh Feb 13 '17 at 20:21
  • @PatrickHaugh It's O(m*n) with n the number of words and m the average word length. In the above case, `m` is probably bounded, so it's O(n) – Eric Duminil Feb 13 '17 at 20:25

3 Answers3

1

The data structure you are looking for is called a trie. The Wikipedia article about this kind of search tree is certainly worth reading. The key property of the trie that comes in handy here is this:

All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

The code could look as follows:

words = """py
python
co
comp
computer""".split()

def make_trie(ws):
    """Build trie from word list `ws`."""
    r = {}  # trie root
    for w in ws:
        d = r
        for c in w:
            d = d.setdefault(c, {})  # get c, set to {} if missing
        d['$'] = '$' # end marker
    return r

def num_pref(t, ws):
    """Use trie `t` to find word with max num of prefixes in `ws`."""
    b, m = -1, ''  # max prefixes, corresp. word
    for w in ws:
        d, p = t, 1
        for c in w:
            if '$' in d: p += 1
            d = d[c]  # navigate down one level
        if p > b: b, m = p, w
    return b, m

t = make_trie(words)
print(num_pref(t, words))

make_trie builds the trie, num_pref uses it to determine the word with maximum number of prefixes. It prints (3, 'computer').

Obviously, the two methods could be combined. I kept them separate to make the process of building a trie more clear.

dnswlt
  • 2,925
  • 19
  • 15
0

For a large amount of words, you could build a trie.

You could then iterate over all the leaves and count the amount of nodes (terminal nodes) with a value between the root and the leaf.

With n words, this should require O(n) steps compared to your O(n**2) solution.

This package looks good, and here's a related thread.

Community
  • 1
  • 1
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
0

The "correct" way is with some sort of trie data structure or similar. However, if your words are already sorted you can get quite a speedup in practical terms with some rather simple code that uses a prefix stack instead of a brute force search. This works since in sorted order, all prefixes precede their prefixed word (making it easy to get a result via a simple linear scan). Think of it as a reasonable compromise between simple code and efficient code:

prefixes = []   # Stack of all applicable prefixes up to this point (normally very small)
max_prefixes = [None]
for w in sorted(wordlist):
    while prefixes and not w.startswith(prefixes[-1]):
        prefixes.pop()
    prefixes.append(w)
    if len(prefixes) >= len(max_prefixes):
        max_prefixes = list(prefixes)
result = max_prefixes[-1]

Running on all dictionary words on my Linux box (479828 of them), the above code takes only 0.68s seconds (the original code doesn't complete in a reasonable amount of time). On the first 10000 words, my code takes 0.02s instead of 19.5s taken by the original code.

If you want really efficient code (say, you're dealing with gigabytes of data), you're better off using the proper data structures coded up in a cache-friendly manner in C. But that could take weeks to write properly!

Cameron
  • 96,106
  • 25
  • 196
  • 225