What is the preferred way to compose a set from multiple lists in Python

Question

I have a few different lists that I want to turn into a set and use to find the difference from another set. Let's call them A, B, and C.

Is the more optimal way to do this set(A + B + C) or set(A).union(set(B)).union(set(C))

Would certain properties of A, B, and C like the number of duplicates or length affect this decision?

Would having an arbitrary number of sets?

There are various ideas over at http://stackoverflow.com/questions/2151517/pythonic-way-to-create-union-of-all-values-contained-in-multiple-lists — Alex Riley, Sep 09 '15 at 14:52
Ah good catch I will update my question -- I incorrectly assumed `+` would perform a set union behind the scenes. — hert, Sep 09 '15 at 14:55

Hai Vu · Accepted Answer · 2015-09-09T15:48:00.397

12

For small lists, set(A + B + C) works fine. For larger lists, the following is more efficient because it does not create a temporary list:

myset = set(A)
myset.update(B)
myset.update(C)

A different approach uses itertools.chain, which is also efficient because it does not create temporary list:

import itertools
myset = set(itertools.chain(A, B, C))

edited Sep 09 '15 at 15:48

answered Sep 09 '15 at 14:51

Hai Vu

37,849
11
66
93

Actually, it turns out that `itertools.chain` is not much faster than `A+B+C` in this case, even for 10000-element lists (timing data: `A+B+C` 2.33ms, `update` 2.09ms, `chain` 2.23ms) – nneonneo Sep 09 '15 at 15:08
Results are weird, please see http://stackoverflow.com/questions/32483539/why-is-creating-a-set-from-a-concatenated-list-faster-than-using-update - `A+B+C` is actually faster for large lists... – nneonneo Sep 09 '15 at 15:24

Sait · Answer 2 · 2015-09-09T15:14:17.080

Here is some timing experiments:

import numpy as np
import itertools

for r in [10,100,1000,10000]:
   A = list(np.random.randint(r, size=1000000))
   B = list(np.random.randint(r, size=1000000))

   %timeit set(A).update(B)
   %timeit set(A+B)
   %timeit set(itertools.chain(A, B))
   print('---')

Here is the results for size = 1000:

10000 loops, best of 3: 87.2 µs per loop
10000 loops, best of 3: 87.3 µs per loop
10000 loops, best of 3: 90.7 µs per loop
---
10000 loops, best of 3: 88.2 µs per loop
10000 loops, best of 3: 86.8 µs per loop
10000 loops, best of 3: 89.4 µs per loop
---
10000 loops, best of 3: 80.9 µs per loop
10000 loops, best of 3: 84.5 µs per loop
10000 loops, best of 3: 87 µs per loop
---
10000 loops, best of 3: 97.4 µs per loop
10000 loops, best of 3: 102 µs per loop
10000 loops, best of 3: 107 µs per loop

Here is the results for size = 1000000:

10 loops, best of 3: 89 ms per loop
10 loops, best of 3: 106 ms per loop
10 loops, best of 3: 98.4 ms per loop
---
10 loops, best of 3: 89.1 ms per loop
10 loops, best of 3: 110 ms per loop
10 loops, best of 3: 94.2 ms per loop
---
10 loops, best of 3: 94.9 ms per loop
10 loops, best of 3: 109 ms per loop
10 loops, best of 3: 105 ms per loop
---
10 loops, best of 3: 115 ms per loop
10 loops, best of 3: 143 ms per loop
10 loops, best of 3: 138 ms per loop

So, update() seems to be slightly faster than both other methods. However, I don't think that the time difference is significant.

Thank you for timing these, I haven't seen %timeit before. Would you mind adding one more for `set(A).union(set(B))`? I'm wondering how it compares to update. — hert, Sep 09 '15 at 16:13

What is the preferred way to compose a set from multiple lists in Python

2 Answers2

Linked