search efficiently in a dictionary

Question

I have a long dictionary of form string to string like

{'a':'b' , 'c':'d'}

I want to efficiently search for a key so that I can get the corresponding answer from the dictionary, but I would like a better search than just iterating over the entire dictionary. Is there a better way like storing in a set so that I can search efficiently. I looked at sets, but I could only find ways to store individual string, but not dictionary elements.

What's with `iterating over` a dictionary? Since when it's not an efficient way? — GLHF, Jun 16 '16 at 04:26
You don't "search" for a key, you just access the dictionary by it. It's even in the tutorial. — Ignacio Vazquez-Abrams, Jun 16 '16 at 04:27
by iterating over I meant like running a loop . for i in range(0,len(dict)) — user40647, Jun 16 '16 at 04:29
Python sets are *built on top of dictionaries* because dictionaries have fast lookup, and can't have duplicate keys - so sets won't help you. `for key in mydict.keys(): print key` will iterate over the keys instead of the keys and values. But what do you mean 'search', what are you doing? It sounds a bit like the wrong datastructure for the job. — TessellatingHeckler, Jun 16 '16 at 04:34
Time complexity of set and dict is similar. On the average it is O(1) but in worst case O(n). It all succeeds or fails with the quality and time complexity of the hash function used to find slots for "keys". If the hash function is slow, well ok with strings that should be the "fast track" and no problem, When the entries produce many collisions (qualtity of hash function), then you will approach more and more the O(n) regime. That would be noticeable also upon insertion. Did you measure any problematic insert / retrieval times already? The size n is 10^3, 10^4, 10^5, ... ? — Dilettant, Jun 16 '16 at 04:47

Tom Karzes · Accepted Answer · 2016-06-16T04:49:09.223

1

If you have a dictionary d and want to test for membership of a key k, you can just use k in d or k not in d. For example:

>>> d = {'a':'b' , 'c':'d'}
>>> 'a' in d
True
>>> 'x' in d
False
>>> 'a' not in d
False
>>> 'x' not in d
True
>>>

These checks should be very efficient since dictionaries (and sets) are implemented with hash tables.

edited Jun 16 '16 at 04:49

answered Jun 16 '16 at 04:25

Tom Karzes

22,815
2
22
41

but does it search efficiently or traverses the entire list ? – user40647 Jun 16 '16 at 04:28
It is very efficient. That's the whole point of dictionaries. – Tom Karzes Jun 16 '16 at 04:46

score 0 · Answer 2 · edited May 23 '17 at 10:33

There is already an interesting quest on How are Python's Built In Dictionaries Implemented ...

but why not try and measure here (after having read about time coplexity in python implementation):

#! /usr/bin/env python
from __future__ import division, print_function

import dis  # for disassembling (bottom)
import string  # string.printable as sample character source
import timeit


chars = tuple(string.printable)
cartesian = tuple(a + b for a in chars for b in chars)
assert 10000 == len(cartesian), print(len(cartesian))
d = dict((a + b, b) for a in cartesian for b in chars)

assert 1000000 == len(d), print(len(d))
assert d['zzz'] == 'z'

setup = """
import string
chars = tuple(string.printable)
d = dict((a + b, b) for a in chars for b in chars)
"""

assert 1000000 / 10000 == 100
setup_100x = """
import string
chars = tuple(string.printable)
cartesian = tuple(a + b for a in chars for b in chars)
d = dict((a + b, b) for a in cartesian for b in chars)
"""

stmt = """
'zzz' in d
"""


t = timeit.timeit(stmt=stmt, setup=setup, timer=timeit.default_timer,
                  number=timeit.default_number)

print("# Timing[secs] for 1x10000:", t)

t_100x = timeit.timeit(stmt=stmt, setup=setup_100x, timer=timeit.default_timer,
                       number=timeit.default_number)

print("# Timing[secs] for 100x10000:", t_100x)

disassemble_me = "'zzz' in {'a': 'b'}"
print("# Disassembly(lookup in dict with 1 string entry):")
print("#", disassemble_me)
dis.dis(disassemble_me)

disassemble_me = "'zzz' in {'a': 'b', 'c': 'd'}"
print("# Disassembly(lookup in dict with 2 string entries):")
print("#", disassemble_me)
dis.dis(disassemble_me)

disassemble_me = "'zzz' in {'a': 'b', 'c': 'd', 'e': 'f'}"
print("# Disassembly(lookup in dict with 3 string entries):")
print("#", disassemble_me)
dis.dis(disassemble_me)

On my machine with Python 2.7.11 this gives:

# Timing[secs] for 1x10000: 0.0406861305237
# Timing[secs] for 100x10000: 0.0472030639648
# Disassembly(lookup in dict with 1 string entry):
# 'zzz' in {'a': 'b'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   32039 (32039)
# Disassembly(lookup in dict with 2 string entries):
# 'zzz' in {'a': 'b', 'c': 'd'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   11303 (11303)
       19 SLICE+2        
       20 <39>           
       21 DUP_TOPX        14887
       24 SLICE+2        
       25 <39>           
       26 LOAD_CONST      32039 (32039)
# Disassembly(lookup in dict with 3 string entries):
# 'zzz' in {'a': 'b', 'c': 'd', 'e': 'f'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   11303 (11303)
       19 SLICE+2        
       20 <39>           
       21 DUP_TOPX        14887
       24 SLICE+2        
       25 <39>           
       26 LOAD_CONST      11303 (11303)
       29 SLICE+2        
       30 <39>           
       31 LOAD_NAME       14887 (14887)
       34 SLICE+2        
       35 <39>           
       36 BUILD_TUPLE     32039

So 10000 entries looked up for 'zz' in 10^4 entries dict in approx. 40 milli seconds on the average (timeit.default_number == 1000000) and below 50 milli secs for 100 times as many, that is 10^6 entries (for 'zzz'lookup).

# Timing[secs] for 1x10000: 0.0406861305237
# Timing[secs] for 100x10000: 0.0472030639648

Measuring means repeatability :-) thus running it again:

# Timing[secs] for 1x10000: 0.0441079139709
# Timing[secs] for 100x10000: 0.0460820198059

it settles merely (with other runs not shown here around a dict with these key types and length relations (the keys of the larger dict are also longer!) that there is here no linear worst case realized. It is more like 10% longer run times for the 100 times larger dict (qua entry count) and the 50% larger key length.

Looks good. Suggest to always measure and disassemble when in doubt. HTH.

PS: OP might still want to provide some more code context in future questions, as a data structure is best selected knowing how it wioll be used ;-)

PPS: Raymond Hettinger et al. often optimized CPython implementation with "love to detail" (sorry, no better english expression for this available for me), so expect always specific unrolling implementations for small "size" problems, that is why the disassembly of a toy variant problem might differ substantially from the one, realized for huge sized tasks. That is why I prefer the timeit and (profile measurements) often over the disassembly, but we should get used to reading byte codes, to get ideas when measured performance fails to meet our expectations.

Otherwise: Enjoy the disassembly of the lookup :-)

Update: ... and if you change the statement timed that the small dict geceives a hit 'zz' and the large one not (vice versa than above), you may also encouter these timeings:

# Timing[secs] for 1x10000: 0.0533709526062
# Timing[secs] for 100x10000: 0.0458760261536

where the test if 'zz' in d takes 53 milli secs in small and 46 milli secs in large (on the average of 1000000 trials).

score 0 · Answer 3 · answered Jun 16 '16 at 09:33

0

You could use the str.translate function to replace a string's characters using a table (in this case the dict) from key to value.

answered Jun 16 '16 at 09:33

Gábor Fekete

1,343
8
16

search efficiently in a dictionary

3 Answers3