9

I'm parsing some xml (with some python 3.4 code) and want to retrieve both the text from a node and its id attribute. Example: <li id="12345"> Some text here </li> My current code is structured around the text only (I'm now adding the id, but didn't need this before). I'm looping through a list of text/sentences, and then proceed to do some stuff. So I thought of making a dictionary with the text/sentence as key, and this id attribute as value.

However, this doesn't feel very efficient. The text can be a whole paragraph, making the key very long. Whereas the id is always of a fairly limited length (but still of type str though, e.g. some alpha characters followed by some digits). But making the ids the key and the text the value requires some rewriting of the code. All not very problematic, but this just got me wondering: How inefficient would it be to have the text (potentially a whole paragraph) as key, compared to an id like "ulp_887362487687678" as key?

I can just make two reverse dictionaries (one with id as key, the other with text as key) and compare construction and lookup and all. And I've also found some topics on key length limit (Do Dictionaries have a key length limit?). But I'm merely wondering what your thoughts are on this. Is having such long str keys in your dict something that you definitely want to avoid, or is it not a very big deal? If you could share some pro's/con's, that would be great!

Community
  • 1
  • 1
Igor
  • 1,251
  • 10
  • 21

3 Answers3

14

No, Python string length hardly has an impact on dictionary performance. The only influence the string length could have is on the hash() function used map the key to a hash table slot.

String length has very little impact on the performance of hash():

>>> import random
>>> from timeit import timeit
>>> from string import ascii_letters
>>> generate_text = lambda len: ''.join([random.choice(ascii_letters) for _ in xrange(len)])
>>> for i in range(8):
...     length = 10 + 10 ** i
...     testword = generate_text(length)
...     timing = timeit('hash(t)', 'from __main__ import testword as t')
...     print 'Length: {}, timing: {}'.format(length, timing)
... 
Length: 11, timing: 0.061537027359
Length: 20, timing: 0.0796310901642
Length: 110, timing: 0.0631730556488
Length: 1010, timing: 0.0606122016907
Length: 10010, timing: 0.0613977909088
Length: 100010, timing: 0.0607581138611
Length: 1000010, timing: 0.0672461986542
Length: 10000010, timing: 0.080118894577

I stopped at generating a string of 10 million characters, because I couldn't be bothered waiting for my laptop to generate a 100 million character string.

The timings are pretty much constant, because the value is actually cached on the string object once computed.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
4

The performance of hash() is indeed O(n) for strings, but the result is cached in the string - repeated calls will use the cached value. This is possible because strings are immutable. Martijn's code uses the repeating feature of timeit so you cannot see this effect because in the last case, 10000009 times out of 10000010 the hash code is not calculated.

It still is O(n) underneath:

import random
from timeit import timeit

for i in range(10):
    length = 10 ** i
    # notice number=1 !!!
    timing = timeit('hash(t)', 't = "a" * {}'.format(length), number=1)
    print('Length: {:10d}, timing: {:.20f}'.format(length, timing))

Length:          1, timing: 0.00000437500057159923
Length:         10, timing: 0.00000287900184048340
Length:        100, timing: 0.00000342299972544424
Length:       1000, timing: 0.00000459299917565659
Length:      10000, timing: 0.00002153400055249222
Length:     100000, timing: 0.00006719700104440562
Length:    1000000, timing: 0.00066680999952950515
Length:   10000000, timing: 0.00673243699930026196
Length:  100000000, timing: 0.04393487600100343116
Length: 1000000000, timing: 0.39340837700001429766

The difference is due to timing errors, branch prediction and alike.

0

Python string length can have a very significant impact on dictionary performance, but you have to get into really big strings.

The issue is apparently that once you've found the right hash bucket, you still have to do a string compare to see if you've got a match. Doing a string compare on two huge strings is expensive unless they're identically the same string object in memory (in which case Python is smart enough to declare equality trivially).

>>> import timeit
>>> for i in range(7):
...   dkey = "X" * (10**i)
...   skey = "X" * (10**i)    # Different object; no fast path
...   d = {dkey: 1}
...   tmp = d[skey]           # Hash is now cached on skey object
...   timing = timeit.timeit('d[skey] == 1', globals=globals(), number=1000)
...   print(len(dkey), " timing is ", timing*1000, " microseconds")
... 
1  timing is  0.119031872600317  microseconds
10  timing is  0.1442211214452982  microseconds
100  timing is  0.1361379399895668  microseconds
1000  timing is  0.16252091154456139  microseconds
10000  timing is  0.5145659670233727  microseconds
100000  timing is  5.568335996940732  microseconds
1000000  timing is  63.68113495409489  microseconds
>>> 

Up to strings around 1000 in length, there's little overhead due to string size. By the time you get past 10000 in length, dictionary lookup appears to be practically O(N) with respect to string length.

Jack Brennen
  • 353
  • 1
  • 5