8

I was reading about the time complexity of set operations in CPython and learned that the in operator for sets has the average time complexity of O(1) and worst case time complexity of O(n). I also learned that the worst case wouldn't occur in CPython unless the set's hash table's load factor is too high.

This made me wonder, when such a case would occur in the CPython implementation? Is there a simple demo code, which shows a set with clearly observable O(n) time complexity of the in operator?

ruohola
  • 21,987
  • 6
  • 62
  • 97

3 Answers3

16

Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.

O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.

The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:

class C:
    def __init__(self, val):
        self.val = val
    def __eq__(a, b):
        return a.val == b.val
    def __hash__(self):
        return 3

Then hash(C(i)) == 3 regardless of the value of i.

To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:

>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}

which shows that the ten thousand distinct ints created all have hash code 1.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
3

You can view the set source here which can help: https://github.com/python/cpython/blob/723f71abf7ab0a7be394f9f7b2daa9ecdf6fb1eb/Objects/setobject.c#L429-L441

It's difficult to devise a specific example but the theory is fairly simple luckily :) The set stores the keys using a hash of the value, as long as that hash is unique enough you'll end up with the O(1) performance as expected.

If for some weird reason all of your items have different data but the same hash, it collides and it will have to check all of them separately.

To illustrate, you can see the set as a dict like this:

import collection


your_set = collection.defaultdict(list)


def add(value):
    your_set[hash(value)].append(value)


def contains(value):
    # This is where your O(n) can occur, all values the same hash()
    values = your_set.get(hash(value), [])
    for v in values:
        if v == value:
            return True
    return False
Wolph
  • 78,177
  • 11
  • 137
  • 148
0

This a sometimes called the 'amortization' of a set or dictionary. It's shows up now and then as an interview question. As @TimPeters says resizing happens automagically at 2/3 capacity, so you'll only hit O(n) if you force the hash, yourself.

In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case run time per operation, rather than per algorithm, can be too pessimistic.

`/* GROWTH_RATE. Growth rate upon hitting maximum load.
 * Currently set to used*3.
 * This means that dicts double in size when growing without deletions,
 * but have more head room when the number of deletions is on a par with the
 * number of insertions.  See also bpo-17563 and bpo-33205.
 *
 * GROWTH_RATE was set to used*4 up to version 3.2.
 * GROWTH_RATE was set to used*2 in version 3.3.0
 * GROWTH_RATE was set to used*2 + capacity/2 in 3.4.0-3.6.0.
 */
#define GROWTH_RATE(d) ((d)->ma_used*3)`

More to the efficiency point. Why 2/3 ? The Wikipedia article has a nice graph https://upload.wikimedia.org/wikipedia/commons/1/1c/Hash_table_average_insertion_time.png accompanying the article . (linear probing curve corresponds to O(1) to O(n) for our purposes, chaining is a more complicated hashing approach) See https://en.wikipedia.org/wiki/Hash_table for the complete

Say you have a set or dictionary which is stable, and is at 2/3 - 1 of it underlying capacity. Do you really want sluggish performance forever? You may wish to force resizing it upwards.

"if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys()." plus some other useful if dated observations. Improving performance of very large dictionary in Python

For a good read on dictresize(): (dict was in Python before set) https://github.com/python/cpython/blob/master/Objects/dictobject.c#L415

ShpielMeister
  • 1,417
  • 10
  • 23
  • 1
    This isn't to do with amortization; the time complexity of a single `in` operation on a Python set is O(1) in the average case. You don't have to average over a whole algorithm for it to be O(1) per operation. – kaya3 Jan 25 '20 at 05:46
  • OP " and worst case time complexity of O(n)." As the hash table fills in, every 'in' request on a set or dict lookup tends to O(n). – ShpielMeister Jan 25 '20 at 05:54
  • Yes, but that isn't to do with amortization; that's the time it takes for a single `in` operation in the worst case. It's not averaged over a large number of `in` operations. – kaya3 Jan 25 '20 at 06:01
  • The O(n) is subject to resizing, of course. `By instrumenting a pure Python model for dictionaries (such as this one), it is possible to count the weighted-average number of probes for an alternative insertion order. For example, inserting dict.fromkeys([11100, 22200, 44400, 33300]) averages 1.75 probes per lookup. That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]).` .Raymond Hettinger from "improving performance" link is in my answer above – ShpielMeister Jan 25 '20 at 06:04
  • The `in` operation doesn't modify the state of the set; it cannot trigger a resizing of the set's underlying array. Your quote from Raymond Hettinger doesn't say anything about amortization, and he is not talking about an amortized average; just an average. – kaya3 Jan 25 '20 at 06:10
  • Of course, not. the point is, if a set is near the 2/3 capacity of the underlying hash table, The 'in' function will be relatively sluggish, and it may be worth resizing the table upwards by forcing the resize. For a huge set which will be heavily utilized, it may be worth the effort, since it could save significant time. The use case is the determining factor. The cost of the resize should also be taken into account. – ShpielMeister Jan 25 '20 at 06:19