228

I've seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

Cody Guldner
  • 2,888
  • 1
  • 25
  • 36
Daenyth
  • 35,856
  • 13
  • 85
  • 124
  • 1
    This medium [article](https://abalcerek.medium.com/how-to-implement-set-with-python-172e73cb2487) shows how pure python implementation looks. – abalcerek Oct 30 '20 at 09:08

6 Answers6

214

According to this thread:

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, was originally mostly a cut-and-paste from the dict implementation.

Note: Nowadays, set and dict's implementations have diverged significantly, so the precise behaviors (e.g. arbitrary order vs. insertion order) and performance in various use cases differs; they're still implemented in terms of hashtables, so average case lookup and insertion remains O(1), but set is no longer just "dict, but with dummy/omitted keys".

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Justin Ethier
  • 131,333
  • 52
  • 229
  • 284
  • 27
    IIRC, the original `set` implementation actually *was* `dict` with dummy values, and it got optimized later. – dan04 Oct 16 '10 at 16:52
  • 2
    Isn't big O the worst case scenario? If you can find an instance where the time is O(n) then it is O(n).. I don't understand anything right now from all those tutorials. – Claudiu Creanga Sep 21 '16 at 21:13
  • 5
    No, average case is O(1) but worst case is O(N) for hash table lookup. – Justin Ethier Sep 22 '16 at 12:10
  • 10
    @ClaudiuCreanga this is an old comment, but just to clarify: big-O notation tells you upper bounds on the growth rate of things, but you can upper bound the growth of average case performance and you can separately upper bound the growth of worst case performance. – Kirk Boyer Jul 30 '18 at 20:20
  • @dan04 so, what is it now? see https://stackoverflow.com/questions/64533262/why-the-size-of-set-can-be-bigger-than-dict-in-python/64533444#64533444 – Gulzar Oct 26 '20 at 08:11
  • @Gulzar: See `setobject.c` for implementation details. It's the same hash table structure as `dict`, but omits the code for dealing with associated values, as `set` only has keys. – dan04 Oct 26 '20 at 15:22
  • @dan04 That is not correct, as is evident from [dict preserving insertion order](https://stackoverflow.com/a/39980744/913098) and set doesn't: `set([1, 2, 3])` is the same as `set([3, 2, 1])`. – Gulzar Sep 01 '21 at 12:37
  • @Gulzar: That change was only made in Python 3.6. Sets were originally introduced in 2.3. – dan04 Sep 01 '21 at 15:24
  • 2
    I edited to clarify that `set` and `dict` not longer use much in the way of common code. Just to elaborate, `dict` is now insertion-ordered and optimizes to minimize overallocation (assumes small `dict`s that don't grow much/at all post-construction), support repeated lookup from a fixed set of keys, and assumes few `dict`s ever remove keys, all of which improve performance when it's used to implement attributes (for all user-defined classes and most user-defined class instances); `set` overallocates more and handles item removal more efficiently, optimizing for deduplication and live updates. – ShadowRanger Dec 27 '21 at 19:32
  • Why elements in a `set` are always in the increase order? In order to acheive that, generally we use a BST as teh underlying implementation. How could we get a order with a hash table? in C++ for exemple `unordered_set` is implemented using a hash table, and `set` is implemented using a BST – ThunderPhoenix Jan 11 '22 at 08:30
  • 5
    @ThunderPhoenix: They're not always in increasing order, but for some types (e.g. `int`) the hash codes are predictable and you'll see increasing order in many simple test cases. In addition, some common tooling (e.g. IPython) sorts `set`s for display rather than displaying the raw iteration order. Python's `set` is similar to C++'s `unordered_set`, not C++'s `set`. If you want a reliable demonstration of this, run `print(set(range(-5, 5)))`. Then for fun, run `print({-1, *range(-5, 5)})` and note the order of `-1` and `-2` changes (on CPython, they have the same hash due to API constraints). – ShadowRanger Jan 13 '22 at 20:05
  • @ShadowRanger any hints as to what the "optimizations for deduplication and live updates" look like, or how item removal is made more efficient? Is there a particular reason why the sparse index + dense table optimization for dicts wouldn't also apply to sets? – Karl Knechtel Sep 23 '22 at 06:47
  • 1
    @KarlKnechtel: Example: A compromise they made to make `dict`s *always* insertion-ordered meant that they could not reclaim space in the dense table when keys are deleted, so repeatedly adding and deleting keys makes it continually expand (until it reaches a certain percentage of empty slots at which point it recompacts the whole `dict`). `set`s use dummy entries for cleared buckets (to ensure open addressing chaining isn't broken) but they can put real entries in said buckets when one arrives; balanced addition and removal from a `set` doesn't require rehashing the whole thing IIRC. – ShadowRanger Sep 23 '22 at 11:14
  • 1
    @KarlKnechtel: Also, the `dict` optimization to use sparse index + dense table in the first place offers the most *because* most `dict`s are small (the sparse index uses `uint8`s as long as it can get away with it, then `uint16`s, etc.); the memory savings reduce as the number of entries increases. Since `dict`s are used under-the-hood to store instance attributes (usually), globals, class contents, etc., there are a *lot* of small `dict`s in Python, while `set`s are assumed to grow larger regularly (deduping large inputs), and therefore any memory savings would be seen less often. – ShadowRanger Sep 23 '22 at 11:20
94

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
15

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

Community
  • 1
  • 1
Shay Erlichmen
  • 31,691
  • 7
  • 68
  • 87
  • 4
    But they do take constant time to look up items: python -m timeit -s "s = set(range(10))" "5 in s" 10000000 loops, best of 3: 0.0642 usec per loop <--> python -m timeit -s "s = set(range(10000000))" "5 in s" 10000000 loops, best of 3: 0.0634 usec per loop ... and that's the largest set that doesn't throw MemoryErrors – Jochen Ritzel Oct 16 '10 at 15:06
  • 2
    @THC4k All you proved is that looking up X is done in constant time, but it doesn't mean that time for looking up X+Y will take the same amount of time which is what O(1) is all about. – Shay Erlichmen Oct 16 '10 at 15:15
  • I thought it meant that the time to do a lookup was independent of the number of stored values. – intuited Oct 16 '10 at 15:49
  • 3
    @intuited: It does, but the test run above doesn't prove that you can look up "5" in the same time you can look up "485398", or some other number that might be in a horrible collision space. It's not about looking up the same element in a differently-sized hash in the same time (in fact, that's not required at all), but rather it's about whether you can access each entry in the same amount of time in the current table - something which is basically impossible for hash tables to accomplish since there will generally always be collisions. – Nick Bastin Oct 16 '10 at 16:08
  • 3
    In other words, the time to do a lookup depends on the number of stored values, because that increases the likelihood of collisions. – intuited Oct 16 '10 at 23:54
  • 4
    @intuited: no, that's incorrect. When the number of stored values increase, Python will automatically increase the size of the hashtable, and the rate of collision stays roughly constant. Assuming an evenly distributed O(1) hash algorithm, then hashtable lookup is *amortized* O(1). You might want to watch the video presentation "The Mighty Dictionary" http://python.mirocommunity.org/video/1591/pycon-2010-the-mighty-dictiona – Lie Ryan Oct 17 '10 at 01:09
  • @Lie Ryan: Yes, it depends on whether we're talking about the context of the original question (which asks about the CPython implementation) or the context of this answer, which is about "the simplest model" where "the hash does not resize". Though it's arguable that after increasing the size of the hashtable it's no longer the same hashtable, so I might get off on a technicality anyway in that case :) – intuited Oct 17 '10 at 02:13
15

We all have easy access to the source, where the comment preceding set_lookkey() says:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...
Community
  • 1
  • 1
gimel
  • 83,368
  • 10
  • 76
  • 104
  • 2
    This answer would benefit from C [syntax highlighting](https://meta.stackexchange.com/questions/184108/what-is-syntax-highlighting-and-how-does-it-work). Python syntax highlighting of the comment looks really bad. – user202729 Jul 28 '18 at 08:47
  • Regarding the comment "This leaves us with a hybrid of linear probing and open addressing", isn't linear probing a kind of collision resolution in open addressing, as described in https://en.wikipedia.org/wiki/Open_addressing? Therefore, linear probing is a subtype of open addressing and the comment makes no sense. – Alan Evangelista Jan 16 '20 at 00:54
5

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify's the main difference of set's against dicts.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

source on github

user1767754
  • 23,311
  • 18
  • 141
  • 164
5

Sets in python employ hash table internally. Let us first talk about hash table. Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time. To expound on the modulo operation, let me also write some code:

piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
    return int(x*100 % 31)

[hash_function(pile) for pile in piles]

Output: [4, 17, 8, 0, 16, 11, 10, 20, 21, 18]