You can do it in O(log(n) + m)
time, where n
is the number of tuples and m
is the number of matching tuples, if you can afford to keep two sorted copies of the tuples.
Sorting itself will cost O(nlog(n))
, i.e. it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n)
, which is almost certainly quite small) it will pay off.
The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets.
However note that you want a strange kind of comparison: you care for all strings starting with the given argument. This simply means that when searching for the right-most occurrence you should fill the key with 9
s.
A complete working(although not tested very much) code:
from random import randint
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set(x_appr, y_appr):
x_left = bisect_left(f_sorted, x_appr, key=first)
x_right = bisect_right(f_sorted, x_appr, key=first)
x_candidates = f_sorted[x_left:x_right + 1]
y_left = bisect_left(s_sorted, y_appr, key=second)
y_right = bisect_right(s_sorted, y_appr, key=second)
y_candidates = s_sorted[y_left:y_right + 1]
return set(x_candidates).intersection(y_candidates)
And the comparison with your initial solution:
In [2]: def lookup_set2(x_appr, y_appr):
...: return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [3]: lookup_set('123', '124')
Out[3]: set([])
In [4]: lookup_set2('123', '124')
Out[4]: []
In [5]: lookup_set('123', '125')
Out[5]: set([])
In [6]: lookup_set2('123', '125')
Out[6]: []
In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])
In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]
In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop
In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop
In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop
In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop
As you can see this solution is about 240-1400
times faster(in these examples) than your naive approach.
If you have a big set of matches:
In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop
In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop
In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop
In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop
In [25]: len(lookup_set2('', '2'))
Out[25]: 33053
As you can see this solution is faster even if the number of matches is about 10% of the total size. However, if you try to match all the data:
In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop
In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop
It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements.
Note that the time take to sort
the data is quite small:
In [13]: from random import randint
...: from operator import itemgetter
...:
...: first = itemgetter(0)
...: second = itemgetter(1)
...:
...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
In [14]: %%timeit
...: f_sorted = sorted(sa2, key=first)
...: s_sorted = sorted(sa2, key=second)
...: max_length = max(len(s) for _,s in sa2)
...:
1 loops, best of 3: 881 ms per loop
As you can see it takes less than one second to do the two sorted copies. Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n)
memory).
This means that if you have to do more than about 6-8 queries this solution will be faster.
Note: python'd standard library provides a bisect
module. However it doesn't allow a key
parameter(even though I remember reading that Guido wanted it, so it may be added in the future). Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom.
Instead of:
f_sorted = sorted(sa, key=first)
You should do:
f_sorted = sorted((first, (first,second)) for first,second in sa)
I.e. you explicitly insert the key as the first element of the tuple. Afterwards you could use ('123', '')
as element to pass to the bisect_*
functions and it should find the correct index.
I decided to avoid this. I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case.
Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances.