3

I'm new to posting here, so I hope I provide enough detail. I'm trying to find out if key selection effects the efficiency of dictionaries in Python. Some comparisons I'm thinking of are:

  • numbers vs strings (e.g. would my_dict[20] be faster than my_dict['twenty'])
  • len of strings (e.g. would my_dict['a'] be faster than my_dict['abcdefg'])
  • mixing key types within a dictionary, for instance using numbers, strings, and/or tuples (e.g. would my_dict = {0: 'zero', 2: 'two'} perform faster than {0: 'zero', 'two': 2})

I haven't been able to find this topic from a google search, so I thought maybe someone here might know.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
ThatNewGuy
  • 197
  • 11
  • 2
    For the lookups integer hash values are used. This makes the type having a very low impact on the performance. – Klaus D. Aug 09 '18 at 19:29
  • 2
    Hashing is very fast for pretty much every type. The differences are going to be very small. I wouldn't worry about it. – Patrick Haugh Aug 09 '18 at 19:32
  • See this very similar [SO question](https://stackoverflow.com/questions/11162201/is-it-always-faster-to-use-string-as-key-in-a-dict) (although not sure yours is exactly a duplicate). – benvc Aug 09 '18 at 19:50

2 Answers2

1

First of all, I'd recommend you to understand How are Python's Built In Dictionaries Implemented.

Now, let's make a little random experiment to prove the theory (at least partially):

import timeit
import string
import random
import time


def random_str(N):
    return ''.join(
        random.choice(string.ascii_uppercase + string.digits) for _ in range(N)
    )


def experiment1(dct, keys):
    s = time.time()
    {dct[k] for k in keys}
    return time.time() - s


if __name__ == "__main__":
    N = 10000
    K = 200
    S = 50000
    dct1 = {random_str(K): None for k in range(N)}
    dct2 = {i: None for i in range(N)}
    keys1 = list(dct1.keys())
    keys2 = list(dct2.keys())

    samples1 = []
    samples2 = []
    for i in range(S):
        samples1.append(experiment1(dct1, keys1))
        samples2.append(experiment1(dct2, keys2))

    print(sum(samples1), sum(samples2))

    # print(
    #     timeit.timeit('{dct1[k] for k in keys1}', setup='from __main__ import dct1, keys1')
    # )

    # print(
    #     timeit.timeit('{dct2[k] for k in keys2}', setup='from __main__ import dct2, keys2')
    # )

The results I've got with different sampling sizes on my box were:

  • N=10000, K=200, S=100 => 0.08300423622131348 0.07200479507446289
  • N=10000, K=200, S=1000 => 0.885051965713501 0.7120392322540283
  • N=10000, K=200, S=10000 => 8.88549256324768 7.005417346954346
  • N=10000, K=200, S=50000 => 43.57453536987305 34.82594871520996

So as you can see, no matter whether you use big random strings to lookup dictionaries or integers, the performance will remain almost the same. The only "real" difference you'd like to consider would be in terms of memory consumption of both dictionaries. That may be relevant when dumping/loading huge dictionaries to/from disk, in that case it may be worth to have compact form of your data structures so you'll be able to shave off few seconds when caching/reading them.

NS: If anyone is able to explain why I was getting really huge times when using timeit (commented parts) please let me... with little experiment constants I'd get really high values... that's why I left it uncommented. Add a comment if you know the reason ;D

BPL
  • 9,632
  • 9
  • 59
  • 117
0

I don't know the answer either but it's easy enough to test.

from timeit import default_timer as timer
import string

#Create a few dictionaries, all the values are None
nums = range(1,27)
dict_numbers = dict.fromkeys(nums)

letters = string.ascii_lowercase
dict_singleLetter = dict.fromkeys(letters)

long_names = []
for letter in letters:
    long_names.append(''.join([letter, letters]))
dict_longNames = dict.fromkeys(long_names)

#Function to time thousands of runs to average out discrepancies
def testSpeed(dictionary, keys):
    x = None
    start = timer()
    for _ in range(1,100000):
        for i in keys:
            x = dictionary[i]
    end = timer()

    return str(end - start)

#Time the different dictionaries
print("Number took " + testSpeed(dict_numbers, nums) + " seconds")
print("Single letters took " + testSpeed(dict_singleLetter, letters) + " seconds")
print("Long keys took " + testSpeed(dict_longNames, long_names) + " seconds")

All of these dictionaries are the same length and contain the same value for each key. When I ran this the dictionary with the long keys was actually always the fastest, albeit by only maybe 5%. Which could possible be accounted for by other minor differences I am unaware of. Numbers and single letters were pretty close in speed but numbers were generally barely faster then single letters. Hopefully this answers your question, and this code should be easy enough to expand to test mixed cases, but I'm out of time at the moment.

Gunnar
  • 19
  • 6