I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put **
around the phrase and add it to another list. I also need to do this in the fastest possible way.
This is what I have so far:
import re
for sentence in sentences:
for phrase in phrases:
if phrase in sentence.lower():
iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
newsentence = iphrase.sub("**"+phrase+"**", sentence)
newlist.append(newsentence)
So far this approach takes about 60 seconds to complete.
I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.
I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.
So essentially, what would be the fastest possible way to perform this check?