Performance - Fastest way to compare 2 large lists of strings in Python

Question

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.

This is what I have so far:

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

So far this approach takes about 60 seconds to complete.

I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.

I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.

So essentially, what would be the fastest possible way to perform this check?

If you broke your sentence list up into x parts (where x is number of cores), and sent each part to a multiprocessor thread? — SteveJ, May 11 '18 at 04:33
There is no point recompiling every phrase regex 10,000 times. I suggest you to compile them in advance and put them into a separate list. — Selcuk, May 11 '18 at 04:45
@Selcuk I failed to mention that not all messages have phrases in them, so I didn't see the need to compile 13000 regex phrases when it is likely only 20 would be used. — Chimney Swift, May 12 '18 at 00:27

score 3 · Accepted Answer · answered May 11 '18 at 05:51

Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

Gives you a results of:

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

Thank you! This worked _much_ better, processed 8736 sentences in 3.13 seconds with a phrase list of 12764. — Chimney Swift, May 12 '18 at 01:47

score 0 · Answer 2 · answered May 11 '18 at 04:53

What about set comprehension?

found = {'**' + p + '**' for s in sentences for p in phrases if p in s}

You could try update (by reduction) the phrases list if you don't mind altering it:

found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
    for i in range(len(phrases)):
        phrase = phrases[i]
        if phrase in s:
            p.remove(phrase)
            found.append('**'+ phrase + '**')
    phrases = p[:]

Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.

We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once, so checking again (although it may exist in another sentence) is unnecessary.

Performance - Fastest way to compare 2 large lists of strings in Python

2 Answers2

Linked