7

How exactly is dict implemented that it has a linear time lookup for collisions? I would assume that it is implemented as a hashtable backed by a list. I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead. Is there some magic happening behind the scenes to keep the constant time lookups alive for as long as possible?

My source for this, by the way, is this:

http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=python+complexity

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
alexgolec
  • 26,898
  • 33
  • 107
  • 159
  • 1
    Worst-case complexity is not the only factor worth optimizing for. – Anon. Feb 01 '11 at 01:11
  • 2
    "I would presume that a better implementation would be O(log(n)) for various operations," Why? Have you seen any benchmarks on this? My understanding is "random" probing is actually fastest on average and leads to O(n) as a worst case. What are you assuming and what measurements have you seen? – S.Lott Feb 01 '11 at 01:13
  • I think Python dicts use 32 bit keys, which means that you need 2**31 or near 620000000000000 keys before you'd expect *a single collision* (excluding objects whose implementation of `__hash__` is really bad, but I'd rather see that as a bug). So collisions really have no practical implications and time spend on optimizing for them is time wasted. – Jochen Ritzel Feb 01 '11 at 01:37
  • @Jochen, I think you have unrealistic expectations of a hashing function. One which gives you a collision before exhausting the buckets is _not_ really bad, it's actually quite common. See how many people you can get through before you have a birthday clash, it will most certainly _not_ be 365. You can _have_ perfect hash functions but only if you understand the data in advance. Given a general purpose hashing function, you can create a collision with just two entries if you know the algorithm. – paxdiablo Feb 01 '11 at 01:48
  • 2
    @Jochen: Of course they have collisions; a hash table does not typically have 2^32 buckets. (Also, 2^31 is just 2147483648, not 620000000000000--and you're forgetting about the birthday problem entirely.) – Glenn Maynard Feb 01 '11 at 01:51

5 Answers5

12

Dict is O(1) for most operations, except for operations that touch all elements, such as iteration and copy (in which case, it's obviously O(n)).

See: http://wiki.python.org/moin/TimeComplexity

It has O(n) worst case, because you can always contrive a pathological example where all the keys have the same hash value.

payne
  • 13,833
  • 5
  • 42
  • 49
  • 1
    Good answer. It's important to keep in mind that [Big-O](http://en.wikipedia.org/wiki/Big_O_notation) is an upper-bound limit -- even if the [amortized performance](http://en.wikipedia.org/wiki/Amortized_analysis) is significantly better. Unfortunately, the amortized performance is often *taken as* the complexity. –  Feb 01 '11 at 01:53
2

The point of choosing one implementation over another isn't necessarily about the upper-bound, but rather the expected amortized performance. While different algorithms can have degenerate cases it's usually "better in practice" than using an approach with a provable lower upper-bound. In some cases, however, structures must be designed to guard against pathologically bad input.

Also, some languages/libraries -- not sure about Python -- actually change the underlying implementation, such as when the number of items exceeds a low n. This affects the amortized performance (in certain cases), but not necessarily the big O.

And in conclusion: "It depends".

Happy coding.

1

Consider even the best hash function in the galaxy. There's still a chance that you may walk up one day with a list of values whose best hash function value happens to be all the same. If you put those in a dict, the system has no choice but to perform linear searches.

Using a balanced tree would keep the worst-case time down at O(log n), but the maintenance costs are pretty high. Usually, hash tables perform pretty well.

Pointy
  • 405,095
  • 59
  • 585
  • 614
1

I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead.

Trees and hash tables have very different requirements and performance characteristics.

  • Trees require an ordered type.
  • Trees require performing order comparisons to find the object. For some objects, like strings, this prevents some significant optimizations: you always need to perform a string comparison, which is nontrivially expensive. This makes the constant factor of O(log n) fairly high.
  • Hash tables require a hashable type, and that you can test for equality, but they don't require an ordered type.
  • Testing for equality can be optimized significantly. If two strings are interned, you can test if they're equal in O(1) by comparing their pointer, rather than O(n) by comparing the whole string. This is a massive optimization: in every foo.bar lookup that gets translated to foo.__dict__["bar"], "bar" is an interned string.
  • Hash tables are O(n) in the worst case, but examine what leads to that worst case: a very bad hash table implementation (eg. you only have one bucket), or a broken hash function that always returns the same value. When you have a proper hash function and a proper bucketing algorithm, lookups are very cheap--very often approaching constant time.

Trees do have significant advantages:

  • They tend to have lower memory requirements, since they don't have to preallocate buckets. The smallest tree might be 12 bytes (node pointer and two child pointers), where a hash table tends to be 128 bytes or more--sys.getsizeof({}) on my system is 136.
  • They allow ordered traversal; it's extremely useful to be able to iterate over [a,b) in an ordered set, which hash tables don't allow.

I do consider it a flaw that Python doesn't have a standard binary tree container, but for the performance characteristics needed by the Python core, like __dict__ lookups, a hash table does make more sense.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
1

Reliable sources of information about the hash functions and collision-resolution strategy that are actually used include the comments in the source file dictobject.c and the whole of the file dictnotes.txt

John Machin
  • 81,303
  • 11
  • 141
  • 189