0

I need a fast intersection of ordered 'list-like' variables

set offers the faster intersection in python, when compared to numpy.array or looping a list, see timing code below for reference or the numerous posts on intersection on SO (1, 2, 3,...)

As set.intersection is a for loop, I'd guess it should be applicable in principle to ordered structures.

Is there an high-speed alternative that I am missing?

edits: adding the frozenset solution, which does appears to be quite fast and maintain the order


Alternatives explored so far

I. First thing that came to mind is collections.OrderedDict, however it does not seem to fit the bill because using a dict would be a waste of space and the keys of an OrderedDict don't offer an intersection method, which brings you back to converting them to a set.

II. Second option is to use a classic dict, intersect on its keys and maintain the sorting order in its values, however this results in many more operations to extract the sorted values and defeats the idea of doing the sorting off-line in a pre-arranged dictionary.

III. A "nuclear alternative" that came to mind is looping in cython as per this google groups post, however I'd favour a pure-python solution if possible for project guidelines (and admittedly like this other user here my C is a bit rusty)


Timing code for intersection in Python

import numpy as np
import timeit
# === Create some data ========================================================
x_array = np.random.randint(0,1000,100)
y_array = np.random.randint(0,1000,100)
x_set = set(x_array)
y_set = set(y_array)    
x_list = list(x_array)
y_list = list(y_array)
x_frozenset = frozenset(x_list)
y_frozenset = frozenset(y_list)
# === Prepare the test Functions ==============================================
def f_set():
    return x_set.intersection(y_set)
def f_numpy():
    return np.intersect1d(x_array,y_array)
def f_list():
    return [i for i in x_list if i in y_list]      
def f_frozenset_loop():
    return [x for x in x_list if x in y_frozenset]     
def f_frozenset_list_filter_1():
    list(filter(y_frozenset.__contains__, x_list))
def f_frozenset_list_filter_2():    
    return [*filter(y_frozenset.__contains__, x_list)]
# === Output ==================================================================
print('set.intersection          ', timeit.timeit(f_set, number=1000))
print('frozenset/list filter 2   ', timeit.timeit(f_frozenset_list_filter_2, number=1000))
print('frozenset/list filter 1   ', timeit.timeit(f_frozenset_list_filter_1, number=1000))
print('frozenset/list loop       ', timeit.timeit(f_frozenset_loop, number=1000))
print('np.intersect1d            ', timeit.timeit(f_numpy, number=1000))
print('list loop (slooow)        ', timeit.timeit(f_list, number=1000))

Output looks like this on my machine currently. There's 1 order of magnitude of difference between set and numpy:

set.intersection           0.001270713739261699
frozenset/list loop        0.006882539311596062
frozenset/list filter 2    0.01010412826037355
frozenset/list filter 1    0.019078864781689093
np.intersect1d             0.019451117983342953
list loop (slooow)         0.20725976641605037
Community
  • 1
  • 1
Pythonic
  • 2,091
  • 3
  • 21
  • 34
  • try: `[i for i in x_list if i in set(y_list)]` this modifies your last function slightly but should make a big difference because set member testing if O(l) – Chris_Rands Mar 17 '17 at 09:32
  • How do you choose an order if the orders in the source lists don't match? Eg if `a, b = [5, 2, 0, 3, 4, 1,], [6, 3, 7, 1, 8, 2]` what does `intersect(a, b)` return? – PM 2Ring Mar 17 '17 at 09:35
  • Thanks Chris_Rands, I'm addint it to the tester. PM 2Ring, the first list used in the intersectin rules the order – Pythonic Mar 17 '17 at 09:36
  • Why not intersect 2 frozensets? – Pythonic Mar 17 '17 at 09:36
  • `frozenset`s are not ordered (they're just immutable) – Chris_Rands Mar 17 '17 at 09:37
  • 4
    @Chris_Rands your optimisation will work better if you pull the `set(y_list)` out of the comprehension and do it first so it is only computed once. Creating a new `set` every time the condition is tested will be worse than searching a list. – Duncan Mar 17 '17 at 09:38
  • @Duncan Good point, agree completely – Chris_Rands Mar 17 '17 at 09:39
  • How about `list(filter(frozenset(b).__contains__, a))`, or `[*filter(frozenset(b).__contains__, a)]` in Python 3.6 ? – PM 2Ring Mar 17 '17 at 09:52
  • Let me add it, see what it looks like – Pythonic Mar 17 '17 at 09:53
  • Looks in line, see results. Would stick with `[x for x in x_list if x in y_frozenset]`. Thanks! – Pythonic Mar 17 '17 at 09:57
  • On a related note, I posted some speed test code & results at [How to find list intersection?](http://stackoverflow.com/questions/3697432/how-to-find-list-intersection), but of course that question wasn't concerned with preserving order. – PM 2Ring Mar 17 '17 at 10:00
  • 2
    Ummm...what about list(collections.OrderedDict.fromkeys(itertools.chain(a, b)))? – Jon Clements Mar 17 '17 at 10:02

0 Answers0