7

I have a few different lists that I want to turn into a set and use to find the difference from another set. Let's call them A, B, and C.

Is the more optimal way to do this set(A + B + C) or set(A).union(set(B)).union(set(C))

Would certain properties of A, B, and C like the number of duplicates or length affect this decision?

Would having an arbitrary number of sets?

hert
  • 1,012
  • 8
  • 16
  • 3
    I don't think you can use + for adding sets – primero Sep 09 '15 at 14:51
  • 2
    There are various ideas over at http://stackoverflow.com/questions/2151517/pythonic-way-to-create-union-of-all-values-contained-in-multiple-lists – Alex Riley Sep 09 '15 at 14:52
  • Ah good catch I will update my question -- I incorrectly assumed `+` would perform a set union behind the scenes. – hert Sep 09 '15 at 14:55

2 Answers2

12

For small lists, set(A + B + C) works fine. For larger lists, the following is more efficient because it does not create a temporary list:

myset = set(A)
myset.update(B)
myset.update(C)

A different approach uses itertools.chain, which is also efficient because it does not create temporary list:

import itertools
myset = set(itertools.chain(A, B, C))
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
  • Actually, it turns out that `itertools.chain` is not much faster than `A+B+C` in this case, even for 10000-element lists (timing data: `A+B+C` 2.33ms, `update` 2.09ms, `chain` 2.23ms) – nneonneo Sep 09 '15 at 15:08
  • Results are weird, please see http://stackoverflow.com/questions/32483539/why-is-creating-a-set-from-a-concatenated-list-faster-than-using-update - `A+B+C` is actually faster for large lists... – nneonneo Sep 09 '15 at 15:24
3

Here is some timing experiments:

import numpy as np
import itertools

for r in [10,100,1000,10000]:
   A = list(np.random.randint(r, size=1000000))
   B = list(np.random.randint(r, size=1000000))

   %timeit set(A).update(B)
   %timeit set(A+B)
   %timeit set(itertools.chain(A, B))
   print('---')

Here is the results for size = 1000:

10000 loops, best of 3: 87.2 µs per loop
10000 loops, best of 3: 87.3 µs per loop
10000 loops, best of 3: 90.7 µs per loop
---
10000 loops, best of 3: 88.2 µs per loop
10000 loops, best of 3: 86.8 µs per loop
10000 loops, best of 3: 89.4 µs per loop
---
10000 loops, best of 3: 80.9 µs per loop
10000 loops, best of 3: 84.5 µs per loop
10000 loops, best of 3: 87 µs per loop
---
10000 loops, best of 3: 97.4 µs per loop
10000 loops, best of 3: 102 µs per loop
10000 loops, best of 3: 107 µs per loop

Here is the results for size = 1000000:

10 loops, best of 3: 89 ms per loop
10 loops, best of 3: 106 ms per loop
10 loops, best of 3: 98.4 ms per loop
---
10 loops, best of 3: 89.1 ms per loop
10 loops, best of 3: 110 ms per loop
10 loops, best of 3: 94.2 ms per loop
---
10 loops, best of 3: 94.9 ms per loop
10 loops, best of 3: 109 ms per loop
10 loops, best of 3: 105 ms per loop
---
10 loops, best of 3: 115 ms per loop
10 loops, best of 3: 143 ms per loop
10 loops, best of 3: 138 ms per loop

So, update() seems to be slightly faster than both other methods. However, I don't think that the time difference is significant.

Sait
  • 19,045
  • 18
  • 72
  • 99