Why use a set for list comparisons?

Question

I just read another users question while looking for a way to compute the differences in two lists.

My question is why would I do

def diff(a,b):
    b = set(b)
    return [aa for aa in a if aa not in b]

rather than doing

def diff(a,b):
    tmp = []
    for i in a:
        if(i not in b):
            tmp.append(i)
return tmp

edit: just noticed the second diff function actually returned the similarities. It should be correct now.

score 8 · Accepted Answer · answered May 09 '12 at 17:20

Just from an algorithmic perspective, it takes O(n) to construct the set and O(n) to do the list comprehension (since testing if an element is contained in a set is O(1)). However in the second example, it would take O(n^2) to loop through both lists. So regardless of the programming language, the first approach is superior.

Also, list comprehensions in python are inherently faster than a for loop. This reduces the constant factor even more (and significantly so too). The reason why can be summarized in this post which I quote here:

The fact that list comprehensions can only be comprised of expressions, not statements, is a considerable factor, as much less work is required behind the scenes for each iteration. Another factor is that the underlying iteration mechanism for list comprehensions is much closer to a C loop than execution of a for loop.

Good explanation of why the latter is O(n^2). Also, here's some `timeit` timings I did on the list comprehension vs for loop approach (list comprehensions are about twice as fast in my example): https://gist.github.com/2647005 — Ben Hoyt, May 09 '12 at 17:34
wow... that is some upsetting evidence. I'm implementing the first one for sure. — Jake, May 09 '12 at 17:42

score 5 · Answer 2 · answered May 09 '12 at 17:19

The main difference between the two options is that the one that uses set is asymptotically much more efficient.

Under reasonably favourable conditions, looking up an item in a set can be done in O(1) time; looking up an item in a list requires O(n) time.

The second, less significant, difference is that one version uses a list comprehension while the other uses a for loop. List comprehensions tend to produce more compact code. They also tend to be more efficient (although if performance is a concern, the only way to get an accurate picture is by running benchmarks).

Levon · Answer 3 · 2012-05-09T17:36:41.697

2

List comprehension is generally regarded as more efficient than using regular list operations when creating a list. If memory is a concern, you may want to use a generator instead.

This provides a bit of information re performances of for-loops, map and list comprehension. @benhoyt also provided a useful link comparing loops vs list comprehension.

However, please note that if performance is a specific concern, it might be worth your while to time/benchmark your various options to select the best option for your specific requirements.

edited May 09 '12 at 17:36

answered May 09 '12 at 17:17

Levon

138,105
33
200
191

How much faster is a list comprehension in this instance? – Gabe May 09 '12 at 17:20
2

A list comprehension might be *slightly* faster than building a list with append, but not much. The real gain of sets over lists here is that `elem in container` is O(len(container)) if `container` is a list, but O(1) if `container` is a set. – Ben Hoyt May 09 '12 at 17:23
@Gabe I added a link to my post that talks about for-loops vs map vs list comprehensions that might be helpful. – Levon May 09 '12 at 17:30

score 2 · Answer 4 · answered May 09 '12 at 17:19

2

Using the set is usually faster because it only has to iterate b once, while your example has to iterate b once for each element in a.

answered May 09 '12 at 17:19

Gabe

84,912
12
139
238

score 2 · Answer 5 · answered May 10 '12 at 11:21

I've done some tests:

test_lists.py

a = range(1, 1000)
b = range(2, 1002)

tmp = []
for i in a:
    if(i not in b):
        tmp.append(i)

test_set_list_comprehensions.py

a = range(1, 1000)
b = range(2, 1002)

b = set(b)
[aa for aa in a if aa not in b]

test_set.py

a = range(1, 1000)
b = range(2, 1002)
list(set(a).difference(set(b)))

And that's what timeit says:

~$ python -m timeit 'import test_lists'
1000000 loops, best of 3: 0.671 usec per loop
~$ python -m timeit 'import test_set_list_comprehension'
1000000 loops, best of 3: 0.766 usec per loop
~$ python -m timeit 'import test_set'
1000000 loops, best of 3: 0.656 usec per loop

So the best one seems to be:

test_set.py

a = range(1, 1000)
b = range(2, 1002)
list(set(a).difference(set(b)))

Why use a set for list comparisons?

5 Answers5