Fast intersection in Python: what is an _ordered_ alternative to set?

Question

I need a fast intersection of ordered 'list-like' variables

set offers the faster intersection in python, when compared to numpy.array or looping a list, see timing code below for reference or the numerous posts on intersection on SO (1, 2, 3,...)

As set.intersection is a for loop, I'd guess it should be applicable in principle to ordered structures.

Is there an high-speed alternative that I am missing?

edits: adding the frozenset solution, which does appears to be quite fast and maintain the order

Alternatives explored so far

I. First thing that came to mind is collections.OrderedDict, however it does not seem to fit the bill because using a dict would be a waste of space and the keys of an OrderedDict don't offer an intersection method, which brings you back to converting them to a set.

II. Second option is to use a classic dict, intersect on its keys and maintain the sorting order in its values, however this results in many more operations to extract the sorted values and defeats the idea of doing the sorting off-line in a pre-arranged dictionary.

III. A "nuclear alternative" that came to mind is looping in cython as per this google groups post, however I'd favour a pure-python solution if possible for project guidelines (and admittedly like this other user here my C is a bit rusty)

Timing code for intersection in Python

import numpy as np
import timeit
# === Create some data ========================================================
x_array = np.random.randint(0,1000,100)
y_array = np.random.randint(0,1000,100)
x_set = set(x_array)
y_set = set(y_array)    
x_list = list(x_array)
y_list = list(y_array)
x_frozenset = frozenset(x_list)
y_frozenset = frozenset(y_list)
# === Prepare the test Functions ==============================================
def f_set():
    return x_set.intersection(y_set)
def f_numpy():
    return np.intersect1d(x_array,y_array)
def f_list():
    return [i for i in x_list if i in y_list]      
def f_frozenset_loop():
    return [x for x in x_list if x in y_frozenset]     
def f_frozenset_list_filter_1():
    list(filter(y_frozenset.__contains__, x_list))
def f_frozenset_list_filter_2():    
    return [*filter(y_frozenset.__contains__, x_list)]
# === Output ==================================================================
print('set.intersection          ', timeit.timeit(f_set, number=1000))
print('frozenset/list filter 2   ', timeit.timeit(f_frozenset_list_filter_2, number=1000))
print('frozenset/list filter 1   ', timeit.timeit(f_frozenset_list_filter_1, number=1000))
print('frozenset/list loop       ', timeit.timeit(f_frozenset_loop, number=1000))
print('np.intersect1d            ', timeit.timeit(f_numpy, number=1000))
print('list loop (slooow)        ', timeit.timeit(f_list, number=1000))

Output looks like this on my machine currently. There's 1 order of magnitude of difference between set and numpy:

set.intersection           0.001270713739261699
frozenset/list loop        0.006882539311596062
frozenset/list filter 2    0.01010412826037355
frozenset/list filter 1    0.019078864781689093
np.intersect1d             0.019451117983342953
list loop (slooow)         0.20725976641605037

try: `[i for i in x_list if i in set(y_list)]` this modifies your last function slightly but should make a big difference because set member testing if O(l) — Chris_Rands, Mar 17 '17 at 09:32
How do you choose an order if the orders in the source lists don't match? Eg if `a, b = [5, 2, 0, 3, 4, 1,], [6, 3, 7, 1, 8, 2]` what does `intersect(a, b)` return? — PM 2Ring, Mar 17 '17 at 09:35
Thanks Chris_Rands, I'm addint it to the tester. PM 2Ring, the first list used in the intersectin rules the order — Pythonic, Mar 17 '17 at 09:36
@Chris_Rands your optimisation will work better if you pull the `set(y_list)` out of the comprehension and do it first so it is only computed once. Creating a new `set` every time the condition is tested will be worse than searching a list. — Duncan, Mar 17 '17 at 09:38
How about `list(filter(frozenset(b).__contains__, a))`, or `[*filter(frozenset(b).__contains__, a)]` in Python 3.6 ? — PM 2Ring, Mar 17 '17 at 09:52
Looks in line, see results. Would stick with `[x for x in x_list if x in y_frozenset]`. Thanks! — Pythonic, Mar 17 '17 at 09:57
On a related note, I posted some speed test code & results at [How to find list intersection?](http://stackoverflow.com/questions/3697432/how-to-find-list-intersection), but of course that question wasn't concerned with preserving order. — PM 2Ring, Mar 17 '17 at 10:00
Ummm...what about list(collections.OrderedDict.fromkeys(itertools.chain(a, b)))? — Jon Clements, Mar 17 '17 at 10:02

Fast intersection in Python: what is an _ordered_ alternative to set?

0 Answers0