Python: how to search for a substring in a set the fast way?

Question

I have a set containing ~300.000 tuples

In [26]: sa = set(o.node for o in vrts_l2_5) 
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]

Now I want to lookup elements based on a common substring, e.g. the first 4 'digits' (in fact the elements are strings). This is my approach:

def lookup_set(x_appr, y_appr):
    return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]

Is there a more efficient, that is, faster way to to this?

are the numbers your searching for set lengths? for example if you're looking for a num that starts with `6652` with the target always be like `665274` (6 digits?) — monkut, Oct 03 '13 at 09:12
Yes, the target length will always be the same. `len(tuple_element[0]) = 6, len(tuple_element[1]) = 7` — LarsVegas, Oct 03 '13 at 09:16
If you could copy that list, you could sort a copy by the first string, sort the other copy by the second and use a bisection algorithm to extract the candidates, then use `set`s to perform the intersection of the results. This should be `O(log(n) + m)` expected time (where `n` is the number of pairs and `m` the number of matches), while your solution is `O(n)`. — Bakuriu, Oct 03 '13 at 11:38
@Bakuriu Sounds great but honestly, I have no idea what you are saying. Can you provide some code to illustrate your approach? — LarsVegas, Oct 03 '13 at 11:48

Bakuriu · Accepted Answer · 2013-10-03T13:02:41.820

You can do it in O(log(n) + m) time, where n is the number of tuples and m is the number of matching tuples, if you can afford to keep two sorted copies of the tuples. Sorting itself will cost O(nlog(n)), i.e. it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n), which is almost certainly quite small) it will pay off.

The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets.

However note that you want a strange kind of comparison: you care for all strings starting with the given argument. This simply means that when searching for the right-most occurrence you should fill the key with 9s.

A complete working(although not tested very much) code:

from random import randint
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set(x_appr, y_appr):
    x_left = bisect_left(f_sorted, x_appr, key=first)
    x_right = bisect_right(f_sorted, x_appr, key=first)
    x_candidates = f_sorted[x_left:x_right + 1]
    y_left = bisect_left(s_sorted, y_appr, key=second)
    y_right = bisect_right(s_sorted, y_appr, key=second)
    y_candidates = s_sorted[y_left:y_right + 1]
    return set(x_candidates).intersection(y_candidates)

And the comparison with your initial solution:

In [2]: def lookup_set2(x_appr, y_appr):
   ...:     return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]

In [3]: lookup_set('123', '124')
Out[3]: set([])

In [4]: lookup_set2('123', '124')
Out[4]: []

In [5]: lookup_set('123', '125')
Out[5]: set([])

In [6]: lookup_set2('123', '125')
Out[6]: []

In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])

In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]

In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop

In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop

In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop

In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop

As you can see this solution is about 240-1400 times faster(in these examples) than your naive approach.

If you have a big set of matches:

In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop

In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop

In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop

In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop

In [25]: len(lookup_set2('', '2'))
Out[25]: 33053

As you can see this solution is faster even if the number of matches is about 10% of the total size. However, if you try to match all the data:

In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop

In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop

It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements.

Note that the time take to sort the data is quite small:

In [13]: from random import randint
    ...: from operator import itemgetter
    ...: 
    ...: first = itemgetter(0)
    ...: second = itemgetter(1)
    ...: 
    ...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]

In [14]: %%timeit
    ...: f_sorted = sorted(sa2, key=first)
    ...: s_sorted = sorted(sa2, key=second)
    ...: max_length = max(len(s) for _,s in sa2)
    ...: 
1 loops, best of 3: 881 ms per loop

As you can see it takes less than one second to do the two sorted copies. Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n) memory).

This means that if you have to do more than about 6-8 queries this solution will be faster.

Note: python'd standard library provides a bisect module. However it doesn't allow a key parameter(even though I remember reading that Guido wanted it, so it may be added in the future). Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom.

Instead of:

f_sorted = sorted(sa, key=first)

You should do:

f_sorted = sorted((first, (first,second)) for first,second in sa)

I.e. you explicitly insert the key as the first element of the tuple. Afterwards you could use ('123', '') as element to pass to the bisect_* functions and it should find the correct index.

I decided to avoid this. I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case.

Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances.

monkut · Answer 2 · 2013-10-03T21:04:57.663

1

Integer manipulation is much faster than string. (and smaller in memory as well)

So if you can compare integers instead you'll be much faster. I suspect something like this should work for you:

sa = set(int(o.node) for o in vrts_l2_5)

Then this may work for you:

def lookup_set(samples, x_appr, x_len, y_appr, y_len):
    """

    x_appr == SSS0000  where S is the digit to search for
    x_len == number of digits to S (if SSS0000 then x_len == 4)
    """
    return ((x, y) for x, y in samples if round(x, -x_len) ==  x_appr and round(y, -y_len) == y_approx)

Also, it returns a generator, so you're not loading all the results into memory at once.

Updated to use round method mentioned by Bakuriu

edited Oct 03 '13 at 21:04

answered Oct 03 '13 at 09:31

monkut

42,176
24
124
155

The `(x % x_appr) < x_limit` is quite unreadable and hard to get. You can use `round` and a negative number of places: `round(125121, -3) -> 125000`, and then use `==`. However this solution is still not asymptotically better than the OPs(which may not be relevant, but it could be a sign it can be made much faster). – Bakuriu Oct 03 '13 at 12:56
Uhm no wait. It isn't simply unreadable. It's *plain wrong*. For example if `x_appr = SSS000` and x = 2*x_appr+1` the test would pass, but it shouldn't have passed(almost certainly). – Bakuriu Oct 03 '13 at 13:05
Nice! Didn't know round took a negative – monkut Oct 03 '13 at 20:49
Thanks for pointing that out, my tests were incomplete. I believe the conditions could be modified to account for that, but the round method is much cleaner. – monkut Oct 03 '13 at 21:01

score 1 · Answer 3 · edited May 23 '17 at 12:03

You could use a trie data structure. It is possible to build one with a tree of dict objects (see How to create a TRIE in Python) but there is a package marisa-trie that implements a memory-efficient version by binding to c++ libraries

I have not used this library before, but playing around with it, I got this working:

from random import randint
from marisa_trie import RecordTrie

sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
            RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))

def lookup_set(sa_tries, x_appr, y_appr):
    """lookup prefix in the appropriate trie and intersect the result"""
     return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & 
             set(item[1] for item in sa_tries[1].items(unicode(y_appr))))

lookup_set(sa_tries, "2", "4")

score 1 · Answer 4 · answered Oct 03 '13 at 18:19

I went through and implemented the 4 suggested solutions to compare their efficiency. I ran the tests with different prefix lengths to see how the input would affect performance. The trie and sorted list performance is definitely sensitive to the length of input with both getting faster as the input gets longer (I think it is actually sensitivity to the size of output since the output gets smaller as the prefix gets longer). However, the sorted set solution is definitely faster in all situations.

In these timing tests, there were 200000 tuples in sa and 10 runs for each method:

for prefix length 1
  lookup_set_startswith    : min=0.072107 avg=0.073878 max=0.077299
  lookup_set_int           : min=0.030447 avg=0.037739 max=0.045255
  lookup_set_trie          : min=0.111548 avg=0.124679 max=0.147859
  lookup_set_sorted        : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
  lookup_set_startswith    : min=0.066498 avg=0.069850 max=0.081271
  lookup_set_int           : min=0.027356 avg=0.034562 max=0.039137
  lookup_set_trie          : min=0.006949 avg=0.010091 max=0.032491
  lookup_set_sorted        : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
  lookup_set_startswith    : min=0.065708 avg=0.068467 max=0.079485
  lookup_set_int           : min=0.023907 avg=0.033344 max=0.043196
  lookup_set_trie          : min=0.000774 avg=0.000854 max=0.000929
  lookup_set_sorted        : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
  lookup_set_startswith    : min=0.065742 avg=0.068987 max=0.077351
  lookup_set_int           : min=0.026766 avg=0.034558 max=0.052269
  lookup_set_trie          : min=0.000147 avg=0.000167 max=0.000189
  lookup_set_sorted        : min=0.000065 avg=0.000068 max=0.000070

Here's the code:

import random
def random_digits(num_digits):
    return random.randint(10**(num_digits-1), (10**num_digits)-1)

sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]

### naive approach
def lookup_set_startswith(x_appr, y_appr):
    return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]

### trie approach
from marisa_trie import RecordTrie

# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
         RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))

def lookup_set_trie(x_appr, y_appr):
 # lookup prefix in the appropriate trie and intersect the result
 return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
        set(item[1] for item in sa_tries[1].items(unicode(y_appr)))

### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]

sa_lens = tuple(map(len, sa[0]))

def lookup_set_int(x_appr, y_appr):
    x_limit = 10**(sa_lens[0]-len(x_appr))
    y_limit = 10**(sa_lens[1]-len(y_appr))

    x_int = int(x_appr) * x_limit
    y_int = int(y_appr) * y_limit

    return [sa[i] for i, int_item in enumerate(sa_ints) \
        if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
           (y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]

### sorted set approach
from operator import itemgetter

first = itemgetter(0)
second = itemgetter(1)

sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)

# See: bisect module from stdlib
def bisect_right(seq, element, key):
    lo = 0
    hi = len(seq)
    element = element.ljust(max_length, '9')
    while lo < hi:
        mid = (lo+hi)//2
        if element < key(seq[mid]):
            hi = mid
        else:
            lo = mid + 1
    return lo


def bisect_left(seq, element, key):
    lo = 0
    hi = len(seq)
    while lo < hi:
        mid = (lo+hi)//2
        if key(seq[mid]) < element:
            lo = mid + 1
        else:
            hi = mid
    return lo


def lookup_set_sorted(x_appr, y_appr):
    x_left = bisect_left(sa_sorted[0], x_appr, key=first)
    x_right = bisect_right(sa_sorted[0], x_appr, key=first)
    x_candidates = sa_sorted[0][x_left:x_right]
    y_left = bisect_left(sa_sorted[1], y_appr, key=second)
    y_right = bisect_right(sa_sorted[1], y_appr, key=second)
    y_candidates = sa_sorted[1][y_left:y_right]
    return set(x_candidates).intersection(y_candidates)     


####
# test correctness
ntests = 10

candidates = [lambda x, y: set(lookup_set_startswith(x,y)), 
              lambda x, y: set(lookup_set_int(x,y)),
              lookup_set_trie, 
              lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
    print "prefix length %d:" % dlen,
    for i in range(ntests):
        print " #%d" % i,
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))
        answers = [c(*prefix) for c in candidates]
        for i, ans in enumerate(answers):
            for j, ans2 in enumerate(answers[i+1:]):
                assert ans == ans2, "answers for %s for #%d and #%d don't match" \
                                    % (prefix, i, j+i+1)
    print


####
# time calls
import timeit
import numpy as np

ntests = 10

candidates = [lookup_set_startswith,
              lookup_set_int,
              lookup_set_trie, 
              lookup_set_sorted]

print "timing..."
for dlen in range(1,5):
    print "for prefix length", dlen

    times = [ [] for c in candidates ]
    for _ in range(ntests):
        prefix = map(str, (random_digits(dlen), random_digits(dlen)))

        for c, c_times in zip(candidates, times):
            tstart = timeit.default_timer()
            trash = c(*prefix)
            c_times.append(timeit.default_timer()-tstart)
    for c, c_times in zip(candidates, times):
        print "  %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))

Nice work putting it all together. As I commented, the trie solution is equivalent to the sorting solution, but it probably has higher constants since it use more complex data structures. the other two approaches are asymptotically worse(and we can clearly see that). — Bakuriu, Oct 03 '13 at 22:22
@Bakuriu. The trie solution and sorting solution aren't equivalent. The trie should be `O(len(input))` vs the sorted solution `O(log(n))` (number of tuples). The trie's performance theoretically should be independent of the dataset size. (A __binary tree__ would be equivalent) But as you suggest the higher constants associated with the more complex data structure outweigh its benefits until `n` is extremely large — space, Oct 03 '13 at 23:17

score 0 · Answer 5 · answered Oct 03 '13 at 08:34

There may be, but not by terribly much. str.startswith and and are both shortcutting operators (they can return once they find a failure), and indexing tuples is a fast operation. Most of the time spent here will be from object lookups, such as finding the startswith method for each string. Probably the most worthwhile option is to run it through Pypy.

score 0 · Answer 6 · answered Oct 03 '13 at 08:40

A faster solution would be to create a dictionary dict and put the first value as a key and the second as a value.

Then you would search keys matching x_appr in the ordered key list of dict (the ordered list would allow you to optimize the search in key list with a dichotomy for example). This will provide a key list named for example k_list.
And then lookup for values of dict having a key in k_list and matching y_appr.

You can also include the second step (value that match y_appr) before appending to k_list. So that k_list will contains all the key of the correct elements of dict.

Yeah, I was thinking about that. But the implementation needs to be flexible. That is, if the first lookup with, let's say, 4 digits doesn't return a match, I need to be able to try to match the first 3 digits and so on. — LarsVegas, Oct 03 '13 at 08:43

c427 · Answer 7 · 2018-08-19T07:45:28.030

Here I've just compare 'in' method and 'find' method:

The CSV input file contains a list of URL

# -*- coding: utf-8 -*-

### test perfo str in set

import re
import sys
import time
import json
import csv
import timeit

cache = set()

#######################################################################

def checkinCache(c):
  global cache
  for s in cache:
    if c in s:
      return True
  return False

#######################################################################

def checkfindCache(c):
  global cache
  for s in cache:
    if s.find(c) != -1:
      return True
  return False

#######################################################################

print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
    reader = csv.reader(f, delimiter=",")
    for i,line in enumerate(reader):
      cache.add(re.sub("'","",line[2].strip()))

print "  "+str(len(cache))+" PAGES IN CACHE"

print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
  checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart

print "\n\nBYE\n"

results in seconds:

1/3-loading pages...
  482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055


BYE

so, the 'in' method is faster than 'find' method :)

Have fun

Python: how to search for a substring in a set the fast way?

7 Answers7