Average complexity of multiple-sets intersection operation and its under the hood implementation

Question

There is no mention of average-case complexity for the multiple-sets intersection on Python wiki:

https://wiki.python.org/moin/TimeComplexity

Only the worst-case complexity is given:

(n-1)*O(l) where l is max(len(s1),..,len(sn))

What is the average complexity of multiple-sets intersection operation? How is this operation implemented under the hood?

set.intersection(s1,s2,s2,s4 ...sn)

Is the multiple-sets intersection operation implemented in a different way than the two-sets intersection operation because their worst case complexities are different according to python-wiki:

2-sets intersection: O(len(s) * len(t)) Multiple-sets intersection: (n-1)*O(l) where l is max(len(s1),..,len(sn))

So the complexity of two sets using multiple-sets formula should be:

--> (2-1)*O(l) where l is max(len(s1), len(s2)`
--> O(max(len(s1), len(s2))

I think it is pretty different than the complexity notation of two set intersection operation.

On a side note, is there a better way than set intersection for membership check between different sets?

NOTE: I am looking for an explanation rather than just the complexity O() notation :)

Graipher · Accepted Answer · 2018-03-14T11:12:04.517

2

As already answered in a similar question, the implementation of the intersection of two sets is analogous to:

def intersect(a, b):
    if len(a) > len(b):
        a, b = b, a

    c = set()
    for x in a:
        if x in b:
            c.add(x)
    return c

For multiple sets it is implemented as a chain of pairwise intersections roughly equivalent to:

def intersect_multi(a, *others):
    result = a.copy()
    for other in others:
        newresult = result.intersect(other)
        if not newresult:
            return set()
    result = newresult

The average complexity is probably not given for this because it depends on whether or not this returns before going through all others, due to the intersection being empty. It can therefore be anywhere between O(k), with k being the length of the first set in others and the worst case.

The worst case complexity for this is then (N-1) * max(O(set_intersection)). O(set_intersection) is usually O(min(k, l)) as you noted, but O(max(k, l)) if the second is not a set. I guess this is included here, so it is basically determined by the longest set.

The worst case for O(set_intersection) stated in the wiki is very unlikely to occur, as noted on this post by Raymond Hettinger. Apparently it only occurs in case where you have a hash collisions every time, so if x in b becomes O(n) (its worst-case complexity).

It seems like this worst-case is not included in the worst-case complexity of multiple set intersections (maybe not because of how highly unlikely it is to have a hash collision for all members of all sets?).

edited Mar 14 '18 at 11:12

answered Mar 14 '18 at 10:45

Graipher

6,891
27
47

Doesn't answer the core question. It's easy to infer the answer if you know the time complexity of each operation, but this doesn't answer what's the time complexity of set intersection in Python. – Alex Huszagh Mar 14 '18 at 10:51
@AlexanderHuszagh It does answer the question "How is this operation implemented under the hood?". If you have any more insight on the other question asked, please go ahead and post it as answer. – Graipher Mar 14 '18 at 10:52
There's more than one question and the core question is: `What is the average complexity of multiple-sets intersection operation?` You allude to it directly in your answer, and you can infer it, but you do not answer that question. – Alex Huszagh Mar 14 '18 at 10:53
@AlexanderHuszagh Added some discussion of the time complexities involved. – Graipher Mar 14 '18 at 11:14
@Graipher the worst-case is included in python-wiki, the average one isn't. – utengr Mar 14 '18 at 11:29
1

@utengr I did comment on why the average one is not included for `intersect_multi`. What I meant with that last sentence is that it looks to me like the worst-case complexity of the normal `intersect` is not included in the worst-case scenario of `intersect_multi`. – Graipher Mar 14 '18 at 11:44

Mazdak · Answer 2 · 2018-03-15T08:21:55.853

The underlying C implementation of this method within CPython's source code that is responsible for multiple set intersections is called set_intersection_multi. Here's the code:

set_intersection_multi(PySetObject *so, PyObject *args)
{
    Py_ssize_t i;
    PyObject *result = (PyObject *)so;

    if (PyTuple_GET_SIZE(args) == 0)
        return set_copy(so);

    Py_INCREF(so);
    for (i=0 ; i<PyTuple_GET_SIZE(args) ; i++) {
        PyObject *other = PyTuple_GET_ITEM(args, i);
        PyObject *newresult = set_intersection((PySetObject *)result, other);
        if (newresult == NULL) {
            Py_DECREF(result);
            return NULL;
        }
        Py_DECREF(result);
        result = newresult;
    }
    return result;
}

As you can see it's looping over the arguments passed to the caller (python objects) and tries to compute the intersection of the intended set with all the other passed objects.

What is mentioned in Python's Wiki regard the worst case is completely sensible here. Due to the fact that complexity of the intersection between two sets s and t is O(len(s) * len(t)), the worst case in creating the intersection of multiple sets (s1&s2&..&sn) happens when all the sets are valid and contain items and loop executes N - 1 times^*.

This means that it's performing n-1 single intersections between all the sets which in computing the Big O notation we should only consider the maximum length. Thus, its (n-1)*O(l) where l is max(len(s1),..,len(sn)).

Also, if you want to get a better understanding of how on earth the complexity of intersection between two sets or a set and another iterable(s) (because you can do something like set(x).intersection(list(y))) is O(len(s) * len(t)), I'd strongly suggest you to take a close look at source code of set_intersection function.

_{The first argument is copied to the PyObject *result before the loop.}

Average complexity of multiple-sets intersection operation and its under the hood implementation

2 Answers2