I need a fast intersection of ordered 'list-like' variables
set
offers the faster intersection in python, when compared to numpy.array
or looping a list
, see timing code below for reference or the numerous posts on intersection on SO (1, 2, 3,...)
As set.intersection
is a for loop, I'd guess it should be applicable in principle to ordered structures.
Is there an high-speed alternative that I am missing?
edits: adding the frozenset solution, which does appears to be quite fast and maintain the order
Alternatives explored so far
I. First thing that came to mind is collections.OrderedDict
, however it does not seem to fit the bill because using a dict would be a waste of space and the keys of an OrderedDict
don't offer an intersection method, which brings you back to converting them to a set
.
II. Second option is to use a classic dict
, intersect on its keys
and maintain the sorting order in its values
, however this results in many more operations to extract the sorted values and defeats the idea of doing the sorting off-line in a pre-arranged dictionary.
III. A "nuclear alternative" that came to mind is looping in cython as per this google groups post, however I'd favour a pure-python solution if possible for project guidelines (and admittedly like this other user here my C
is a bit rusty)
Timing code for intersection in Python
import numpy as np
import timeit
# === Create some data ========================================================
x_array = np.random.randint(0,1000,100)
y_array = np.random.randint(0,1000,100)
x_set = set(x_array)
y_set = set(y_array)
x_list = list(x_array)
y_list = list(y_array)
x_frozenset = frozenset(x_list)
y_frozenset = frozenset(y_list)
# === Prepare the test Functions ==============================================
def f_set():
return x_set.intersection(y_set)
def f_numpy():
return np.intersect1d(x_array,y_array)
def f_list():
return [i for i in x_list if i in y_list]
def f_frozenset_loop():
return [x for x in x_list if x in y_frozenset]
def f_frozenset_list_filter_1():
list(filter(y_frozenset.__contains__, x_list))
def f_frozenset_list_filter_2():
return [*filter(y_frozenset.__contains__, x_list)]
# === Output ==================================================================
print('set.intersection ', timeit.timeit(f_set, number=1000))
print('frozenset/list filter 2 ', timeit.timeit(f_frozenset_list_filter_2, number=1000))
print('frozenset/list filter 1 ', timeit.timeit(f_frozenset_list_filter_1, number=1000))
print('frozenset/list loop ', timeit.timeit(f_frozenset_loop, number=1000))
print('np.intersect1d ', timeit.timeit(f_numpy, number=1000))
print('list loop (slooow) ', timeit.timeit(f_list, number=1000))
Output looks like this on my machine currently. There's 1 order of magnitude of difference between set and numpy:
set.intersection 0.001270713739261699
frozenset/list loop 0.006882539311596062
frozenset/list filter 2 0.01010412826037355
frozenset/list filter 1 0.019078864781689093
np.intersect1d 0.019451117983342953
list loop (slooow) 0.20725976641605037