Time complexity of set lookup

Question

If I have a set my_set and I want to check if variable x is contained in the set:

if x in my_set:

Is the time complexity here O(1) ?

@D.Basil if it is a `str`, then it is O(n). It is only amortized O(1) for hashtable-based data structures such as `dict` and `set` — Olivier Melançon, May 03 '22 at 19:27
@OlivierMelançon You seem to be talking about the container while they seem to be talking about the elements. — Kelly Bundy, May 03 '22 at 19:28
@KellyBundy Right, my answer describes the time complexity where n is the size of the container. It assumes `x` is hashed in constant time... which may not always be the case. — Olivier Melançon, May 03 '22 at 19:30

score 0 · Answer 1 · answered May 03 '22 at 19:26

0

For the average case: yes

For the worst case: no

https://wiki.python.org/moin/TimeComplexity

answered May 03 '22 at 19:26

Ted Bigham

4,237
1
26
31

Would this time complexity depend on the data type? str vs int, or built in vs custom? – D. Bazel May 03 '22 at 20:30
@D.Basil: The complexity in terms of the length of the `set` does not depend on the data type. That said, the hashing itself is `O(n)` in the length of the data being hashed, and equality testing is also `O(n)` in the amount of data compared, so the performance can vary by data type when the type itself varies tremendously in size. A 4 GB `str` is going to take quite a while to compute the hash for, and if, by coincidence, it shares a hash with another `str` of the same length that begins with the same data (varying only in the last 100 MB or so), then that's going to cost you. – ShadowRanger May 03 '22 at 20:45
@ShadowRanger I can't think of a reasonable example where *hashing* takes more than linear time, but for *equality testing*, imagine for example graph objects that do an isomorphism check. That might take more than linear time. – Kelly Bundy May 03 '22 at 20:51
@KellyBundy: Point. In 99.9...% of cases, equality testing is `O(n)` in the size of the data being compared (though in practice, the average-case tends to be much better thanks to length checks, short-circuiting on first mismatch, etc.). But there exist unusual cases where it's worse. None of this changes the big-O of `set` in terms of the size of the `set` (the average and expected cost of inserting a small `int` in a `set` remains the same for a `set` of any size, and while inserting a large `int` has higher cost, it's because the `int` is bigger, it doesn't matter how big the `set` is). – ShadowRanger May 03 '22 at 20:58
@ShadowRanger Yeah, just saying that especially with "custom" types, it *can* take arbitrarily long. Btw even comparing an ordinary set of ints with its copy can take [quadratic time](https://tio.run/##JY3BCsIwEETv/Yq5mRQtjQWRQg76G@Ih1VSDZhPieiil315Tu4c5PN7sxIGfgZpjTPPcp@DBzlvHcD6GxEg2WsNF0YcEgiNcVF3XW@xzXlGiaQvkO0FjFPuyPKidkpk7LA23NJKhhxUkp795zuapuoU4CPkHnIF3JNYl8Ta@u5t2@ahx3oK@vrNJK7naMTli0W9G0hM4hBdGznPKNm2l@gn@s5Hz/AM) :-) – Kelly Bundy May 03 '22 at 21:08
The size of the key is not a factor here. Time complexity is O(1) when NO keys collide. Time complexity is O(n) when ALL keys collide (since the underlying structure per bucket is usually a linked list). – Ted Bigham May 03 '22 at 21:35
@TedBigham: It's not a linked list per bucket. Python uses open addressing, just one element per bucket, but jumping around according to an exhaustive search algorithm that guarantees it eventually finds an open one. To my knowledge, that strategy is more common in most languages where everything is stored as pointers to objects; C++ doesn't do it because copying an object can be expensive, and linked lists per bucket can avoid most copies, but Python's just moving a pointer per object; wrapping the pointer in an extra linked list would harm performance. – ShadowRanger May 04 '22 at 03:14
Hashtable time complexity is still O(n) if all keys collide, even with open addressing. Linked list "performance" should only be different from a memory perspective. – Ted Bigham May 04 '22 at 18:47
@TedBigham: Agreed. I was correcting the erroneous description of the design; big-O-wise, both are equivalent. That said, open addressing (especially with the system Python uses, which always checks pairs of adjacent buckets before jumping around) is more likely to be cache-friendly, since it's a single contiguous table, not a table plus arbitrary numbers of independent allocations. – ShadowRanger May 17 '22 at 15:22

S.B · Answer 2 · 2022-05-03T20:40:51.313

0

It depends on the data, for example if your set contains string objects(Or any other objects that don't produce the same hash), then the time complexity of membership testing would be O(1).

Hash of string objects are randomly generated for security reasons(I think it's because of dictionary look-ups as keys are strings). Although with different mathematical operations we try our best to avoid hash collisions, they are always possible, so in these cases Python has to continue the probe sequence. I just wanted to give you an overview not to break my first sentence that the time complexity is O(1).

On the other hand, if you have some data which produce the same hash, you won't get O(1).

Here I'm going to force my objects to have identical hash:

from random import randint


class SameHash:
    def __hash__(self) -> int:
        # return randint(1, 5)
        return 1

    def __eq__(self, other):
        return False

    def __repr__(self):
        return "SameHash()"


s = {SameHash() for _ in range(5)}
print(s)
SameHash.__eq__ = lambda self, other: print("__eq__ gets called")


SameHash() in s

output:

{SameHash(), SameHash(), SameHash(), SameHash(), SameHash()}
__eq__ gets called
__eq__ gets called
__eq__ gets called
__eq__ gets called
__eq__ gets called

Now uncomment return randint(1, 5) line and run the script several times to experiment.

edited May 03 '22 at 20:40

answered May 03 '22 at 20:35

S.B

13,077
10
22
49

The hashes don't need to be the same to produce collisions, they only need to point to the same bucket. – Ted Bigham May 04 '22 at 18:50
1

@TedBigham Yes that's true. Two distinct objects which produce different hashes *can* point to the same bucket. But with hash collisions in which two distinct objects produce the *same hash*, they always point to the same bucket. I wanted to demonstrate it so I chose the easiest way by producing the same hash. – S.B May 04 '22 at 19:01
I was just noting the the first statement is not true. All keys could have different hashes, and still not be O(1). – Ted Bigham May 04 '22 at 19:13
@TedBigham Strictly speaking yes there may be a little bit of traversing the probe sequence but they are rare and I think if I claim that the time complexity is O(1) is not wrong. We can consider it O(1), can't we ? – S.B May 04 '22 at 19:19
Best case it's O(1), worst case it's O(n). That also matches the quoted performance on the python wiki. – Ted Bigham May 04 '22 at 19:22
@TedBigham I challenge you to provide example data where it takes linear time with different hashes :-). I've tried that a while ago and failed to do it and would be quite interested. (I think what I actually tried was to find values so that putting them all into a set takes quadratic time, so not just linear time for *one* value but for *all*, which is probably even harder). – Kelly Bundy May 05 '22 at 00:11
The simple example is if there is only one bucket. Then all hashes collide. – Ted Bigham May 05 '22 at 14:47
@TedBigham This is about Python. There is no set with only one bucket (at least not in CPython, don't know about others). – Kelly Bundy May 05 '22 at 15:09
This is about time complexity in python, not about whether or not I can quote the exact keys that make exact hashes that cause collisions. Collisions happen whether you want to believe it or not, and when all the keys collide, time complexity is O(n). – Ted Bigham May 05 '22 at 18:22

Time complexity of set lookup

2 Answers2