11

The question explains it, but what is the time complexity of the set difference operation in Python?

EX:

A = set([...])
B = set([...])

print(A.difference(B)) # What is the time complexity of the difference function? 

My intuition tells me O(n) because we can iterate through set A and for each element, see if it's contained in set B in constant time (with a hash function).

Am I right?

(Here is the answer that I came across: https://wiki.python.org/moin/TimeComplexity)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Q.H.
  • 1,406
  • 2
  • 17
  • 33
  • @Jean-FrançoisFabre isn't a set in Python a hashtable with no values?https://stackoverflow.com/questions/3949310/how-is-set-implemented – Q.H. Dec 31 '17 at 17:07
  • yes, `O(1)` most of the time, but https://stackoverflow.com/questions/7351459/time-complexity-of-python-set-operations can be up to `O(n)` if too many collisions. also the hashing of the elements can be an issue like if they're strings. – Jean-François Fabre Dec 31 '17 at 17:08
  • I have not checked the implementation so I won't post this as an answer, but there is no doubt that the time complexity of `A.difference(B)` is `O(cardinality of set A)`, with the caveat of possible hash collisions on certain pathological datasets. – Reblochon Masque Dec 31 '17 at 17:09
  • @ReblochonMasque It should rather be `O(cardinality of iterable B)` because `A` is the one guaranteed to be a set so it should used for the contains-check while `B` is iterated. – user2390182 Dec 31 '17 at 17:12
  • questions about time complexity in Python in general are not easily answerable, python has many implementations, some might have O(1) lookup in a set, some might have set implemented as a sorted list meaning O(log(n)) (and they can still be faster than the O(1) implementations in practice) – Derte Trdelnik Dec 31 '17 at 17:12

2 Answers2

16

looks that you're right, difference is performed with O(n) complexity in the best cases

But keep in mind that in worst cases (maximizing collisions with hashes) it can raise to O(n**2) (since lookup worst case is O(n): How is set() implemented?, but it seems that you can generally rely on O(1))

As an aside, speed depends on the type of object in the set. Integers hash well (roughly as themselves, with probably some modulo), whereas strings need more CPU.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
3

https://wiki.python.org/moin/TimeComplexity suggests that its O(cardinality of set A) in the example you described.

My understanding that its O(len(A)) and not O(len(B)) because, you only need to check if each element in setA is present in setB. Each lookup in setB is O(1), hence you will be doing len(A) * O(1) lookups on setB. Since O(1) is constant, then its O(len(A))

Eg:

A = {1,2,3,4,5}
B = {3,5,7,8,9,10,11,12,13}
A-B = {1,2,4}

When A-B is called, iterate through every element in A (only 5 elements), and check for membership in B. If not found in B, then it will be present in the result.

Note: Of course all this is amortised complexity. In practice, each lookup in setB could be more than O(1).