Why is converting a list to a set faster than using just list to compute a list difference?

Question

Say, I wish to compute the difference of two lists C = A - B:

A = [1,2,3,4,5,6,7,8,9] 
B = [1,3,5,8,9]
C = [2,4,6,7]          #Result

A and B are both sorted with unique integers (not sure if there is a way to tell Python about this property of the list). I need to preserve the order of the elements. AFAIK there are two possible ways of doing it

Method 1: Convert B into a set and use list comprehension to generate C:

s = set(B)
C = [x for x in A if x not in s]

Method 2: Directly use list comprehension:

C = [x for x in A if x not in B]

Why is #1 more efficient than #2? Isn't there an overhead to convert to a set? What am I missing here?

Some performance benchmarks are given in this answer.

UPDATE: I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?

Don't know Python, but i would hope sets are much better at checking membership. (Lists, being inherently ordered, would require a linear search. Sets typically don't.) — cHao, Aug 13 '14 at 19:51
I was hoping that a list search wouldn't be `O(n)` and perhaps `O(log n)` if sorted. But maybe that's not the case — PhD, Aug 13 '14 at 19:55
If that were the case, a list would have to keep track of whether it is sorted. That would slow things down in the rather-more-common cases where you don't care. (And even where you do care, it's typically because you're using the wrong data structure for the job.) — cHao, Aug 13 '14 at 20:01
And what about the complexity/O-performance for conversion of a list to a set (ie, enforcing unique values)? I'm assuming O(n) since each item in the list has to be traversed/seen. — mecampbellsoup, Dec 28 '19 at 18:38

TheSoundDefense · Accepted Answer · 2021-08-21T17:34:28.143

There is overhead to convert a list to a set, but a set is substantially faster than a list for those in tests.

You can instantly see if item x is in set y because there's a hash table being used underneath. No matter how large your set is, the lookup time is the same (basically instantaneous) - this is known in Big-O notation as O(1). For a list, you have to individually check every element to see if item x is in list z. As your list grows, the check will take longer - this is O(n), meaning the length of the operation is directly tied to how long the list is.

That increased speed can offset the set creation overhead, which is how your set check ends up being faster.

EDIT: to answer that other question, Python has no way of determining that your list is sorted - not if you're using a standard list object, anyway. So it can't achieve O(log n) performance with a list comprehension. If you wanted to write your own binary search method which assumes the list is sorted, you can certainly do so, but O(1) beats O(log n) any day.

EDIT 2:

I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?

No, not at all. Creating a set out of a list is a O(n) operation, as inserting an item into a set is O(1) and you're doing that n times. If you have a list with a million integers in it, converting it into a set involves two O(n) steps, while repeatedly scanning the list is going to be n O(n) steps. In practice, creating the set is going to be about 250,000 times faster for a list with a million integers, and the speed difference will grow larger and larger the more items you have in your list.

And what about the complexity/O-performance for _conversion_ of a list to a set (ie, enforcing unique values)? I'm assuming O(n) since each item in the list has to be traversed/seen. — mecampbellsoup, Dec 28 '19 at 18:37
@mecampbellsoup it would be O(n), but that's a one-time cost. If you perform n or more lookups on that set, then the amortized cost ends up being O(1) again. (If you're only performing one lookup, then it wouldn't be worth it.) — TheSoundDefense, Jan 03 '20 at 18:44

score 9 · Answer 2 · 2014-08-13T21:28:13.290

According to the Python documentation on time complexity

List membership x in s is on average linear-time operation, or O(n).
Set membership x in s is on average constant-time operation, or O(1).

Building a set is worst-case linear-time operation, because one would need to scan all the elements in a list to build a hash-table, so O(n). n is number of elements in a collection.

The key observation is that, in Method 1, building a set, s = set(B) is just a one-off operation, then after that we just have n total number of set-membership test as in x not in B, so in total O(n) + n * O(1), or O(n) time complexity.

Whereas in Method 2, the list-membership test x not in B is carried out for each element in A, so in total n * O(n) = O(n^2) time complexity.

score 8 · Answer 3 · edited Aug 13 '14 at 20:06

8

Average time complexity for lookup (x in S) in a set is O(1) while the same for a list is O(n).

You can check the details at https://wiki.python.org/moin/TimeComplexity

edited Aug 13 '14 at 20:06

PhD

11,202
14
64
112

answered Aug 13 '14 at 19:53

Pankaj Sharma

669
5
12

Aah! I see. So there isn't a special case for sorted lists? A binary search would keep that down to `O(log n)` – PhD Aug 13 '14 at 19:56
@PhD how would Python know the list is sorted? A standard list object has no mechanism for determining that. – TheSoundDefense Aug 13 '14 at 19:58
@TheSoundDefense - Yes. I was hoping if there was some way to tell Python it is sorted. Sorry. I wasn't clear on that – PhD Aug 13 '14 at 20:04
@PhD The [bisect](https://docs.python.org/2/library/bisect.html) module might be useful for you here. – dano Aug 14 '14 at 16:45

Why is converting a list to a set faster than using just list to compute a list difference?

3 Answers3

Linked