20

In Python you can get the intersection of two sets doing:

>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])

Anybody knows the complexity of this intersection (&) algorithm?

EDIT: In addition, does anyone know what is the data structure behind a Python set?

juliomalegria
  • 24,229
  • 14
  • 73
  • 89

3 Answers3

30

The data structure behind the set is a hash table where the typical performance is an amortized O(1) lookup and insertion.

The intersection algorithm loops exactly min(len(s1), len(s2)) times. It performs one lookup per loop and if there is a match performs an insertion. In pure Python, it looks like this:

    def intersection(self, other):
        if len(self) <= len(other):
            little, big = self, other
        else:
            little, big = other, self
        result = set()
        for elem in little:
            if elem in big:
                result.add(elem)
        return result
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • According to the wiki I linked above, the worst case for `elem in big` in your code is O(n) (although the average is of course O(1)). That's the basis for the intersection worst case of O(len(s)*len(t)). Any idea why? – Kurt McKee Nov 12 '11 at 04:33
  • 14
    The "worst case" assumes data that is inappropriate for use in the hash table used by *dict* and *set*. The data would have to be something such that every datum had exactly the same hash value -- this would force the hash table to do something akin to a linear search to do the \_\_contains\_\_ check. IOW, I wouldn't worry about this at all. Set intersection is blindly fast -- it even reuses the internally stored hash values so it won't need to make any calls to *hash()*. – Raymond Hettinger Nov 12 '11 at 04:47
  • Link to 3.x code: [here](https://github.com/python/cpython/blob/3.9/Objects/setobject.c#L1181) it is for 3.9. – user650654 Feb 02 '21 at 19:32
  • The intersection algorithm does not "always run" in the stated complexity, as acknowledged in the comment by @RaymondHettinger above. In practice it may not matter much, but note that this answer in incorrect in theory. – apeman Sep 01 '22 at 13:33
  • @lusil That is profoundly misleading. It is akin to saying that gamblers could consistently win every spin of a roulette wheel. It is true that hash tables are a probabilistic algorithm; however, the odds are very heavily stacked in favor of the user. Given a large table and a randomized hash function, the chances of the intersection algorithm "losing" are infinitesimally low and likely to never occur. – Raymond Hettinger Sep 01 '22 at 14:53
  • I like the current wording much better! – apeman Sep 01 '22 at 18:14
20

The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:

Average:     O(min(len(s), len(t))
Worst case:  O(len(s) * len(t))

EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.

Kurt McKee
  • 1,410
  • 13
  • 17
  • 2
    that's a nasty worst case, isn't it? – juliomalegria Nov 12 '11 at 04:19
  • I was surprised by it as well! Maybe it's an issue of having different data types mixed in the two sets being intersected? – Kurt McKee Nov 12 '11 at 04:22
  • 1
    It doesn't look like it uses a sort first (which *requires the objects have an ordering*), but rather just does a "hash probe": perhaps for a better `C` and Average (and *no ordering requirement*). The maximum required complexity, AFAIK, is about `O(n log n) + O(n)`, with a sort. However, Big-O is an upper-bounds and there are practical considerations so... –  Nov 12 '11 at 04:26
  • 1
    I think the major issue here is that the set is an unordered collection. In C++, you can make an intersection (with two sorted lists) in 2*(L1+L2)-1. That's a damn good complexity! http://cplusplus.com/reference/algorithm/set_intersection/ – juliomalegria Nov 12 '11 at 04:27
  • Nice! I really appreciate the insight; hash collisions make bunches of sense. +1's all around! – Kurt McKee Nov 12 '11 at 04:42
  • 4
    This answer is somewhat misleading with respect to the "worst case" time. Don't let it lead you away from a perfectly fine algorithm. – Raymond Hettinger Nov 12 '11 at 05:09
  • I agree with you; fixed. Again, I appreciate the discussion! – Kurt McKee Nov 12 '11 at 05:23
  • Please don't say "google it" in these answers. These answers often become the results of googling something. This answer was the first search result for me. – user124384 Sep 15 '18 at 19:13
  • Likelihood of something happening is irrelevant when discussion the "worst case". The worst case is the worst case as long as there is a nonzero chance of it happening. However, if Python is using a hash table, that $n^2$ complexity should not happen ever. Therefore, it's not even the worst case at all. – user124384 Sep 15 '18 at 19:17
  • 1
    @user124384 It's funny how the first search result is this thread from that hyperlink – THIS USER NEEDS HELP Aug 01 '19 at 05:35
2

Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way: Assume m << n

1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
    3.1 Sort the next **m** items on the bigger list(cost: m*logm)
    3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set

The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.

I think that's the best lower bound that exists

Elad Yehezkel
  • 111
  • 1
  • 5