Algorithmic way to search a list of tuples for a matching substring?

Question

I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat.

tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...]
vals = ['cheese', 'flees']
ids = {i[0] for i in tuples if any(val in i[1] for val in vals)}

output: {id1}

Is there an algorithm that would allow doing that quicker? I'm interested in both exact substring matches and also possibly in the approximate ones. The main thing I'm after here is an algorithm that would offer speed advantage over the comprehension.

@martineau well, some search algorithm, that would suit the task ) In slightly different circumstances I'd have used Aho-Corasick for matching the substrings but I could not adapt it to this task. So the question becomes, what is the suitable matching algorithm. — Daniel Kislyuk, Nov 25 '20 at 11:49
There's something called [fuzzy matching](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)) that does approximate string matching. One implementation for Python I know of is [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy). — martineau, Nov 25 '20 at 11:51
Here's the pypi link for [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/). — martineau, Nov 25 '20 at 11:55
@martineau while FuzzyWuzzy does approximate string matching for individual strings, it does not provide any improvement in the speed for processing a large set of strings, and this is the main point for me. — Daniel Kislyuk, Nov 25 '20 at 11:55
Approximate matching is always going to be slower than not doing it, regardless of what algorithm is used. — martineau, Nov 25 '20 at 11:56
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225090/discussion-between-daniel-kislyuk-and-martineau). — Daniel Kislyuk, Nov 25 '20 at 11:57
@DanielKislyuk Are the substrings single token, ie no spaces on them? — Dani Mesejo, Nov 25 '20 at 12:24
@DaniMesejo yes, the substrings have no spaces. Thanks for a trie tip, גלעדברקן, will check it out! — Daniel Kislyuk, Nov 25 '20 at 12:34
Two minor points: (1) You could speed that up by multiprocessing. (2) If you group by the ids then you only have to process a group until it finds the first match (might save some time). — Timus, Nov 25 '20 at 12:42

Dani Mesejo · Accepted Answer · 2020-11-25T13:39:09.087

DISCLAIMER I'm the author of trrex

For the case of the exact matching, one approach for solving this, is to use a Trie, as mentioned in the comments. trrex is a library that makes a Trie-Regex (a Trie in regex format) that can be used in conjunction with the regular expression engine of Python:

import random
import pandas as pd
import trrex as tx
import re

df = pd.read_csv('jeopardy-small.csv')
with open('words-sample') as infile:
    words = [line.strip() for line in infile]


tuples = [(random.randint(1, 250), sentence) for sentence in df['question']]


def fun_kislyuk(ws, ts):
    return {t[0] for t in ts if any(w in t[1] for w in ws)}


def fun_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    return {i for i, s in ts if pattern.search(s)}


if __name__ == "__main__":
    print(fun_trrex(words, tuples) == fun_kislyuk(words, tuples))

Output

True

The timings for the above functions are:

%timeit fun_trrex(words, tuples)
11.3 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The data is a list of around 2K questions from jeopardy, and 500 randomly chosen words. You can find here the resources for reproducing the experiments.

UPDATE

If you add the grouping strategy mentioned in the comments the time improvements increases, below is the function:

def fun_grouping_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    groups = defaultdict(list)
    for i, s in ts:
        groups[i].append(s)

    return {i for i, vs in groups.items() if any(pattern.search(v) for v in vs)}

and the timings:

%timeit fun_trrex(words, tuples)
11.2 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_grouping_trrex(words, tuples)
4.96 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The approach of grouping + trrex gives you an approximated 10 times improvement on performance. But take this last result with a grain of salt because it's very dependent on the dataset.

Algorithmic way to search a list of tuples for a matching substring?

1 Answers1

Linked