Efficient way to replace substring from list

Question

Hi I have a large document saved as a sentence and a list of proper names that might be in the document.

I would like to replace instances of the list with the tag [PERSON]

ex: sentence = "John and Marie went to school today....."

list = ["Maria", "John"....]

result = [PERSON] and [PERSON] went to school today

as you can see there might be variations of the name that I still want to catch like Maria and Marie as they are spelled differently but close.

I know I can use a loop but since the list and the sentence is large there might be a more efficient way to do this. Thanks

You will need to formalize what _exactly_ you mean by "differently but close". — NPE, Aug 10 '18 at 11:18
Please post the code you have written so far to narrow down your question to something that other users can understand and answer. — i alarmed alien, Aug 10 '18 at 11:19
by differently I mean spelling variations - how would you formalize that? — gannina, Aug 10 '18 at 11:25
I think a lot of people would consider 'Maria' and 'Marie' to be different names as they are not homophones (they sound different when spoken). You might be able to find a corpus of names grouped into homophones. — i alarmed alien, Aug 10 '18 at 11:28

score 1 · Answer 1 · answered Aug 10 '18 at 11:59

Use fuzzywuzzy to check if each word in the sentence matches closely (with a match percentage above 80%) with that of a name and if so replace it with [PERSON]

>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word  for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'

score 0 · Answer 2 · answered Aug 10 '18 at 11:37

You can use regex inside your input list, to match words with spell variations. For example, if you need to match Marie and Maria, you can use Mari(e|a) as regex. Here is the consequent code you can use:

import re

mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]

myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)

print(myNewSentence)  # [PERSON] and [PERSON] and [PERSON] went to school today.....

Efficient way to replace substring from list

2 Answers2