5

Lets say you have two sets, set1 is very large (a couple million values), and set2 is relatively small (a couple hundred thousand values). If I wanted to get the intersection of values between these two sets using the .interstion() function, would there be a runtime improvement based on the order of the inputs?

For example would one of these run faster than the other?

set1.intersection(set2)
set2.intersection(set1)
DJ Wolfson
  • 111
  • 3
  • Supposing the order *does* matter, wouldn't you think that the runtime code that implements the intersection operation would look at the sizes of the two sets and do the right thing? So I would be very surprised to find a benchmark revealing a difference. – Booboo Jul 10 '20 at 17:26
  • If you have the data you could perform a test. If not, you could manufacture some fake test data. – C. Pappy Jul 10 '20 at 17:28
  • 1
    If you assume that a set lookup is O(1), then it seems you want to take the set with the fewer number of items and test if they are in the set with the larger number of items. Then the running time is O(n) where n is the size of the smaller set. – Booboo Jul 10 '20 at 17:34
  • See [Computing set intersection in linear time?](https://stackoverflow.com/questions/4642172/computing-set-intersection-in-linear-time) for an interesting discussion of the complexity of set intersection algorithms. – C. Pappy Jul 10 '20 at 21:42

1 Answers1

4

No, input order does not matter. In CPython (the standard Python implementation), the set_intersection function handles set intersection. In the case where the other argument is also a set, the function will swap the two sets such that the smaller one is iterated through while the larger set is used for (constant time) lookup, as Booboo described:

        if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) {
            tmp = (PyObject *)so;
            so = (PySetObject *)other;
            other = tmp;
        }

        while (set_next((PySetObject *)other, &pos, &entry)) {
            key = entry->key;
            hash = entry->hash;
            rv = set_contains_entry(so, key, hash);
            if (rv < 0) {
                Py_DECREF(result);
                return NULL;
            }
            if (rv) {
                if (set_add_entry(result, key, hash)) {
                    Py_DECREF(result);
                    return NULL;
                }
            }
        }

Thus, where set1 and set2 are sets, set1.intersect(set2) and set2.intersect(set1) will have the same performance. A small empirical test with timeit agrees:

import random
import string
import timeit

big_set = set()
while len(big_set) < 1000000:
    big_set.add(''.join(random.choices(string.ascii_letters, k=6)))


small_set = set()
while len(small_set) < 10000:
    small_set.add(''.join(random.choices(string.ascii_letters, k=6)))

print("Timing...")
print(f"big_set.intersection(small_set): {min(timeit.Timer(lambda: big_set.intersection(small_set)).repeat(31, 500))}")
print(f"small_set.intersection(big_set): {min(timeit.Timer(lambda: small_set.intersection(big_set)).repeat(31, 500))}")
xavc
  • 1,245
  • 1
  • 8
  • 14