6

I have a dictionary whose keys come in sets that share the same prefix, like this:

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

Given a query string, I need to look up all the values associated with keys that start with that prefix, e.g. for query="key1" I need to get ["valA", "valB", "valC"]

My implementation below works but is too slow for a large number of queries since the dictionary d has about 30,000 keys and most of the keys are more than 20 characters long:

result = [d[s] for s in d.keys() if s.startswith(query)]

Is there a faster/more efficient way to implement this?

enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
I Z
  • 5,719
  • 19
  • 53
  • 100
  • I don't think so - keys are hashed, which means that keys of `key1` and `key12` may have totally different representations in the lookup table. – TigerhawkT3 Aug 05 '15 at 19:35
  • @PadraicCunningham Unless it's `Python3` then `keys` actually returns an iterator if I'm not wrong.. just like `Python2`'s `ikeys` – DeepSpace Aug 05 '15 at 19:43
  • If you want efficiency then go for a database engine (Mysql for example) these are designed to retrieve data as efficiently as possible. – kasper Taeymans Aug 05 '15 at 19:46
  • let us know what approach is faster in your case -- you can [post your own answer with the comparison](http://stackoverflow.com/help/self-answer) – jfs Aug 06 '15 at 04:15

3 Answers3

9

You can avoid producing the intermediate list generated by dict.keys() (in python 2.x):

result = [d[key] for key in d if key.startswith(query)]

But you most likely want to use a trie instead of a dictionary, so you can find all the values associated with a key with a common prefix (a trie is similar to a tree based on prefixes).

Here you can find some different implementation of tries.

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn". (source wikipedia)


Let's compare the timings for the different solutions:

# create a dictionary with 30k entries
d = {str(x):str(x) for x in xrange(1, 30001)}
query = '108'

# dict with keys()
%timeit [d[s] for s in d.keys() if s.startswith(query)]

    100 loops, best of 3: 8.87 ms per loop

# dict without keys()
%timeit [d[s] for s in d if s.startswith(query)]

    100 loops, best of 3: 7.83 ms per loop

# 11.72% improvement

# PyTrie (https://pypi.python.org/pypi/PyTrie/0.2)
import pytrie
pt = pytrie.Trie(d)

%timeit [pt[s] for s in pt.iterkeys(query)]

    1000 loops, best of 3: 320 µs per loop

# 96.36% improvement

# datrie (https://pypi.python.org/pypi/datrie/0.7)
import datrie
dt = datrie.Trie('0123456789')
for key, val in d.iteritems():
    dt[unicode(key)] = val

%timeit [dt[s] for s in dt.keys(unicode(query))]

    10000 loops, best of 3: 162 µs per loop

# 98.17% improvement
enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
  • 1
    In the dict approach, I would stick with deferring the value access until the key is confirmed to be valid (i.e., `[d[s] for s in d if s.startswith(...)]`). – chepner Aug 05 '15 at 19:51
  • @chepner: yes, I think it could be more efficient, especially if, as expected, there are not many matches. – enrico.bacis Aug 05 '15 at 19:54
  • Your comparison does not include the trie construction time or how much memory does it require. Could add the comparison with [`SuffixTree.SubstringDict(b'0123456789')`](http://stackoverflow.com/a/31846621/4279)? Here's [comparison for a similar problem](http://stackoverflow.com/a/5479374/4279) – jfs Aug 08 '15 at 23:18
0

The sortedContainers lib has a SortedDict implementation, once you have sorted dict you can bisect_left to find where to start, bisect_right to find the last position then use irange to get the keys in the range:

from sortedcontainers import SortedDict
from operator import itemgetter
from itertools import takewhile


d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
  "key2":"valD", "key2-22":"valE","key3":"foo" }

key = "key2"
d = SortedDict(sorted(d.items(), key=itemgetter(0)))
start = d.bisect_left(key)
print([d[key] for key in takewhile(lambda x: x.startswith("key2"), d.irange(d.iloc[start]]))
['valD', 'valE']

Once you maintain a sorteddict using the sorteddict is a lot more efficint:

In [68]: l = ["key{}".format(randint(1,1000000)) for _ in range(100000)] 
In [69]: l.sort()    
In [70]: d = SortedDict(zip(l,range(100000)))

In [71]: timeit [d[s] for s in d.keys() if s.startswith("key2")]
10 loops, best of 3: 124 ms per loop

In [72]: timeit [d[s] for s in d if s.startswith("key2")]
10 loops, best of 3: 24.6 ms per loop

In [73]: %%timeit
key = "key2"
start = d.bisect_left(key)
l2 =[d[k] for k in takewhile(lambda x: x.startswith("key2"),d.irange(d.iloc[start]))]
   ....: 

100 loops, best of 3: 5.57 ms per loop
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

You could use a suffix tree:

#!/usr/bin/env python2
from SuffixTree import SubstringDict # $ pip install https://github.com/JDonner/SuffixTree/archive/master.zip

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

a = '\n' # anchor
prefixes = SubstringDict()
for key, value in d.items(): # populated the tree *once*
    prefixes[a + key] = value # assume there is no '\n' in key

for query in ["key1", "key2"]: # perform queries
    print query, prefixes[a + query]

Output

key1 ['valC', 'valA', 'valB']
key2 ['valE', 'valD']
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • The author of this particular suffix tree has declared his project dead, with the comment "there are plenty of other suffix tree implementations in Python". Trouble is, the ones I've tried so far don't expose a convenient dict-like object. – tripleee Apr 16 '18 at 11:21