-1

I have the following string:

oldstring = 'michael smith passes both danny jones III and michael robinson on turn 3!'

I'd like to use the oldstring above and the racer_dict below (or a better solution) to create the newstring below.

name = ['michael smith sr', 'darrel michael robinson', 'danny jones III']
racing_number = ['44', '15', '32']
racer_dict = dict(zip(name, racing_number))
newstring = '44 passes both 32 and 15 on turn 3!'

It's a complicated problem because, as in the example:

  1. sometimes the name being replaced completely matches the racer_dict key
  2. the word length of the names being replaced are not consistent
  3. the same word can show up in two different drivers names (however, I wouldn't expect the same two words to show up in two different drivers' names).

Below is the solution I've come-up with on my own, but seems a bit cumbersome:

# Replace the name in oldstring when it matches the exact name in the dict
old_ones = [x for x in name if x in oldstring]
newstring = oldstring
if len(old_ones) > 0:
    for old in old_ones:
        newstring = re.sub(old, racer_dict.get(old), newstring)

# Now look for when two consecutive words from oldstring are found in the
# dict name, and replace them too
name_strings = []
name_numbers = []
nsw = newstring.split(' ')

for i in range(len(nsw)-1):
    potential_name = nsw[i] + ' ' + nsw[i+1]
    key_name = [x for x in name if potential_name in x]
    if len(key_name) > 0:
        value_number = racer_dict.get(key_name[0])
        name_strings.append(potential_name)
        name_numbers.append(value_number)

if len(name_strings) > 0:        
    replacers = dict(zip(name_strings, name_numbers))
    for j in name_strings:
        newstring = re.sub(j, replacers.get(j), newstring)

       
print(newstring)
# 44 passes both 32 and 15 on turn 3!
bshelt141
  • 1,183
  • 15
  • 31
  • What is your current code? Please show to see where you are heading. – Wiktor Stribiżew Aug 14 '21 at 19:30
  • @WiktorStribiżew - the goal is to progamatically replace the three driver names in the `oldstring` with their corresponding `racing_number` from the `racer_dict`, even though the `name` keys in the `racer_dict` do not completely match the names in `oldstring`. – bshelt141 Aug 14 '21 at 19:34
  • @bshelt141 please define (in your question) what you mean by “even if keys do not completely match names”. When should it match: when it has 1 word in common? 2 words in common? All words in common? If it’s e.g. 2 words, do they have to be consecutive in the key? Should matching ignore the case? Ignore white spaces? – Cimbali Aug 14 '21 at 21:20

1 Answers1

0

Maybe:

#pip install spacy
#python -m spacy download en_core_web_sm
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

#pip install fuzzywuzzy
#pip install python-Levenshtein
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

#https://stackoverflow.com/questions/51214026/ and answer by Golden Lion
def find_persons(text):
     # Create Doc object
     doc2 = nlp(text)

     # Identify the persons
     persons = [ent.text for ent in doc2.ents if ent.label_ == 'PERSON']

     # Return persons
     return persons

#see https://towardsdatascience.com/e982c61f8a84
def match_names(name, list_names, min_score=0):
    max_score = -1
    max_name = ''
    for x in list_names:
        score = fuzz.ratio(name, x)
        if (score > min_score) & (score > max_score):
            max_name = x
            max_score = score
    return (max_name, max_score)

#see https://stackoverflow.com/questions/2400504/ and answer by ChristopheD
def multipleReplace(text, wordDict):
    for key, value in wordDict.items():
        text = text.replace(value[0], value[1])
    return text


oldstring = 'michael smith passes both danny jones III and michael robinson on turn 3!'

source_names = ['michael smith sr', 'darrel michael robinson', 'danny jones III']

racing_number = ['44', '15', '32']

racer_dict = dict(zip(source_names, racing_number))

#using spacy and find_persons()
foundNames = find_persons(oldstring)

#using fuzzywuzzy
names = []
for x in foundNames:
    match = match_names(x, source_names, 75)
    if match[1] >= 75:
        name = (str(match[0]), str(x))
        names.append(name)
name_dict = dict(names)

# https://stackoverflow.com/questions/11313568/ and the answer by Martijn Pieters
swaps = {k: [name_dict[k], racer_dict[k]] for k in name_dict}

newstring = multipleReplace(oldstring, swaps)

#output old and new side by side
print('  Old: ', oldstring, '\n\n ', 'New: ', newstring)

  Old:  michael smith passes both danny jones III and michael robinson on turn 3! 

  New:  44 passes both 32 and 15 on turn 3!

Works on the string provided. I tried it on another set of data:

oldstring = 'Geoff Jones moves up on Sebastian Hughes, but the battle for the lead is between Patel and Larry'

source_names = ['Sebastian Hughes', 'Geoff Jones', 'Larry Gordon', 'Jeff Patel']

racing_number = ['5', '33', '44', '7']

...but had to lower the limits (lines 56 and 57) to a value of 55 to get it to match all the names and then get:

  Old:  Geoff Jones moves up on Sebastian Hughes, but the battle for the lead is between Patel and Larry 

  New:  33 moves up on 5, but the battle for the lead is between 7 and 44
MDR
  • 2,610
  • 1
  • 8
  • 18