There's a C++ comparison to get union of lists from lists of lists: The fastest way to find union of sets
And there's several other python related questions but none suggest the fastest way to unionize the lists:
From the answers, I've gathered that there are at least 2 ways to do it:
>>> from itertools import chain
>>> x = [[1,2,3], [3,4,5], [1,7,8]]
>>> list(set().union(*x))
[1, 2, 3, 4, 5, 7, 8]
>>> list(set(chain(*x)))
[1, 2, 3, 4, 5, 7, 8]
Note that I'm casting the set to list afterwards because I need the order of the list to be fixed for further processing.
After some comparison, it seems like list(set(chain(*x)))
is more stable and takes less time:
from itertools import chain
import time
import random
# Dry run.
x = [[random.choice(range(10000))
for i in range(10)] for j in range(10)]
list(set().union(*x))
list(set(chain(*x)))
y_time = 0
z_time = 0
for _ in range(1000):
x = [[random.choice(range(10000))
for i in range(10)] for j in range(10)]
start = time.time()
y = list(set().union(*x))
y_time += time.time() - start
#print 'list(set().union(*x)):\t', y_time
start = time.time()
z = list(set(chain(*x)))
z_time += time.time() - start
#print 'list(set(chain(*x))):\t', z_time
assert sorted(y) == sorted(z)
#print
print y_time / 1000.
print z_time / 1000.
[out]:
1.39586925507e-05
1.09834671021e-05
Taking out the variable of casting sets to list:
y_time = 0
z_time = 0
for _ in range(1000):
x = [[random.choice(range(10000))
for i in range(10)] for j in range(10)]
start = time.time()
y = set().union(*x)
y_time += time.time() - start
start = time.time()
z = set(chain(*x))
z_time += time.time() - start
assert sorted(y) == sorted(z)
print y_time / 1000.
print z_time / 1000.
[out]:
1.22241973877e-05
1.02684497833e-05
Here's the full output when I try to print the intermediate timings (without list casting): http://pastebin.com/raw/y3i6dXZ8
Why is it that list(set(chain(*x)))
takes less time than list(set().union(*x))
?
Is there another way of achieving the same union of lists? Using numpy
or pandas
or sframe
or something? Is the alternative faster?