-1

Hi I have a large document saved as a sentence and a list of proper names that might be in the document.

I would like to replace instances of the list with the tag [PERSON]

ex: sentence = "John and Marie went to school today....."

list = ["Maria", "John"....]

result = [PERSON] and [PERSON] went to school today

as you can see there might be variations of the name that I still want to catch like Maria and Marie as they are spelled differently but close.

I know I can use a loop but since the list and the sentence is large there might be a more efficient way to do this. Thanks

Rakesh
  • 81,458
  • 17
  • 76
  • 113
gannina
  • 173
  • 1
  • 8
  • 3
    You will need to formalize what _exactly_ you mean by "differently but close". – NPE Aug 10 '18 at 11:18
  • Please post the code you have written so far to narrow down your question to something that other users can understand and answer. – i alarmed alien Aug 10 '18 at 11:19
  • by differently I mean spelling variations - how would you formalize that? – gannina Aug 10 '18 at 11:25
  • I think a lot of people would consider 'Maria' and 'Marie' to be different names as they are not homophones (they sound different when spoken). You might be able to find a corpus of names grouped into homophones. – i alarmed alien Aug 10 '18 at 11:28

2 Answers2

1

Use fuzzywuzzy to check if each word in the sentence matches closely (with a match percentage above 80%) with that of a name and if so replace it with [PERSON]

>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word  for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'
Sunitha
  • 11,777
  • 2
  • 20
  • 23
0

You can use regex inside your input list, to match words with spell variations. For example, if you need to match Marie and Maria, you can use Mari(e|a) as regex. Here is the consequent code you can use:

import re

mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]

myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)

print(myNewSentence)  # [PERSON] and [PERSON] and [PERSON] went to school today.....
Laurent H.
  • 6,316
  • 1
  • 18
  • 40