fastest way to search python dict with partial keyword

Question

What is the fastest way to determine if a dict contains a key starting with a particular string? Can we do better than linear? How can we achieve an O(1) operation when we only know the start of a key?

Here is the current solution:

for key in dict.keys():
    if key.start_with(str):
        return True
return False

I doubt you can acheive anything better as you cannot infer the hash of the key from a part of the key. Also this leaves room to ambiguities if two keys start with the same prefix. — Hyperboreus, Aug 05 '13 at 19:57
There are data structures that can do this, but they aren't available in the Python standard library. Tries or binary search trees, for example. — , Aug 05 '13 at 20:01
Since the question is about speed, I feel obligated to point out that `for key in dict_:` is much faster than `for key in dict_.keys():`, as the latter constructs a list of keys. — Chris Barker, Aug 05 '13 at 20:10
@ChrisBarker: good point for python 2.7; for immutable operations over keys one can use `dict.viewkeys()` — Jakub M., Aug 05 '13 at 20:18
I wonder if you could subclass `str` to get this behavior natively within dictionary keys... — 2rs2ts, Aug 05 '13 at 22:34
Go with a trie as in arshajii's answer. It's basically 1 dictionary lookup per character in the prefix string. — nmclean, Aug 05 '13 at 22:45

arshajii · Accepted Answer · 2013-08-05T20:05:52.387

46

Without preprocessing the dict, O(n) is the best you can do. It doesn't have to be complicated, though:

any(key.startswith(mystr) for key in mydict)

(Don't use dict and str as variable names, those are already the names of two built-in functions.)

If you can preprocess the dict, consider putting the keys in a prefix tree (aka trie). There is even a Python implementation in the Wikipedia article.

edited Aug 05 '13 at 20:05

answered Aug 05 '13 at 19:57

arshajii

127,459
24
238
287

A trie is O(log N), not O(1). But it's almost certainly what you want here. This is pretty much the paradigm case for the data structure. – abarnert Aug 05 '13 at 20:38
@abarnert No, not unless you make the weird assumption that the greatest string length is logarithmic in the number of strings. Lookup in a trie is linear in the length of the key, and thus independent of the number of strings in the trie. – Aug 05 '13 at 22:29
@delnan: N isn't the number of strings, it's the number of distinct symbols. If you have a small and static number of symbols (e.g., with ASCII strings), you can ignore that. If you have a large number of symbols (e.g., arbitrary Unicode), you can't. Either you end up doing a linear search at each trie level, or a log N once. (Yes, it's _also_ linear in the length of the strings, and I neglected that…) – abarnert Aug 05 '13 at 23:02
@abarnert The alphabet size is virtually always constant. In addition, the representations I'm familiar with and tend to assume (though there are other ways) use arrays for constant time lookup of the child for a symbol. This even works for very large alphabets like Unicode, without wasting too much space, by using UTF-8 code units, rather than code points, as nodes. Alternatively, use a decent hash table for O(1) average time. – Aug 05 '13 at 23:53
If you need to return the key itself (defaulting to `None` on no matches): `next( (key for key in mydict if key.startswith(mystr)), None )`. Note that this returns the first match when iterating the dictionary keys [*in no particular order*](https://stackoverflow.com/a/40007169/477563) and stops on first match. – Mr. Llama Apr 22 '18 at 19:29
The Wikipedia article no longer has a Python implementation, but you can readily find some if you search for them. – John Y Nov 26 '18 at 22:28

score 0 · Answer 2 · edited May 23 '17 at 12:34

You could put all the prefixes of the inserted keys to the dict, so for key foo you would insert f, fo and foo. You would have O(1) lookup, but you would spend time on preprocessing (O(k), where k is a key length), and waste lots of memory:

def insert_with_prefixes(key, value, dict_):
  prefixes = (key[:i+1] for i in xrange(len(key)))
  dict_.update((prefix, value) for prefix in prefixes)

For everyday use I would go (and I go) with the method in arshajii's answer. And of course, have in mind possible many collisions for short prefixes (here: "h"):

>>> a = {}
>>> insert_with_prefixes('hello', 'world', a)
>>> insert_with_prefixes('homo', 'sapiens', a)
>>> a
{'h': 'sapiens', 'hom': 'sapiens', 'homo': 'sapiens', 'ho': 'sapiens', 
 'hel': 'world', 'hell': 'world', 'hello': 'world', 'he': 'world'}

fastest way to search python dict with partial keyword

2 Answers2

Linked

Related