Specifying word boundaries for multiple string replacement with regex?

Question

I'm trying to mask city names in a list of texts using 'PAddress' tags. To do this, I borrowed thejonny's solution here for how to perform multiple regex substitutions using a dictionary with regex expressions as keys. In my implementation, the cities are keys and the values are tags that correspond to the exact format of the keys (this is important because the format must be preserved down the line). Eg., {East-Barrington: PAddress-PAddress}, so East-Barrington would be replaced by PAddress-PAddress ; one tag per word with punctuation and spacing preserved. Below is my code - sub_mult_regex() is the helper function called by mask_multiword_cities().

def sub_mult_regex(text, keys, tag_type):
    '''
    Replaces/masks multiple words at once
    Parameters:
        Text: TIU note
        Keys: a list of words to be replaced by the regex
        Tag_type: string you want the words to be replaced with
    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).
    Eg., {68 Oak St., PAddress PAddress PAddress.,}
    Returns text with relevant text masked
    '''
    # Creating a list of values to correspond with keys (see key:value example in docstring)

    add_vals = []
    for val in keys:
        add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys
    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index += re.compile(target).groups + 1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked 
    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

    return text_sub, diff_list 

 

def mask_multiword_cities(text_string):
    multi_word_cities = list(set([city for city in us_cities_all if len(city.split(' ')) > 1 and len(city) > 3 and "Mc" not in city and "State" not in city and city != 'Mary D']))
    return sub_mult_regex(text_string, multi_word_cities, "PAddress")

The problem is, the keys in the regex dictionary don't have word boundaries specified, so while only exact matches should be tagged (case insensitive), phrases like 'around others' gets tagged because it thinks that the city 'Round O' is in it (technically that is a substring within that). Take this example text, run through the mask_multiword_cities function:

add_string = "The cities are Round O , NJ and around others"

mask_multiword_cities(add_string)

#(output): ('The cities are PAddress PAddress NJ , and aPAddress PAddressthers', [' Round', ' O', ' around', ' others'])

The output should only be ('The cities are PAddress PAddress NJ , and around others', [' Round', ' O']). I've tried converting each key to a regex expression like r"\b(?=\w)key\b(?!\w)" at various points in the sub_mult_regex function (lines 26 and 37) but that didn't work as expected.

For testing, assume that: us_cities_all = ['Great Barrington', 'Round O', 'East Orange'].

Also, if anyone can help make this run faster/be more efficient, that would be great! Right now, it takes about 30 seconds to run on a 1000-word note, likely because us_cities_all contains 5,000 cities. Let me know if it would be more helpful to directly post the cities list, I wasn't sure how to do so.

@Ramesh sure — add_string is the example that causes the most obvious error, but a text example of “The hospital is in East Orange and around o” would also overtag “round o” as a city when it isn’t in that context (because of the word boundary issue I describe). The text input is always a string, and the goal is to find and mask/tag Address terms — skaleidoscope, Dec 31 '22 at 06:16
you want to replace as `Round O` as `PAddress PAddress` exactly as in `multi_word_cities` and you want capture the text after `and` am i right? — Ramesh, Dec 31 '22 at 07:35
Hi @Ramesh -- sorry for the confusion, I didn't understand your question the first time. I did make a typo initially and didn't include the word 'and' in the output -- I've edited the post to reflect what the true output should look like. "The hospital is in East Orange and around o" would be incorrectly tagged as "The hospital is in PAddress PAddress and aPAddress PAddress". Looks like your solution would still work regardless, just would exclude the line that exclude the 'and'. Thank you! — skaleidoscope, Dec 31 '22 at 15:14
@Ramesh thank you very much! Would this solution work in test cases with more than one city in it? Ex, “There is one hospital in Great Barrington and another around o in Round O” which should be masked as “There is one hospital in PAddress PAddress and another around o in PAddress PAddress” — skaleidoscope, Jan 01 '23 at 14:52
yes, it is possible but you need to change the little bit. hint: try the replacer function in my example. if you find any difficulty update the expected output and how you want your output(like exceptional cases). — Ramesh, Jan 02 '23 at 04:50

skaleidoscope · Answer 1 · 2023-01-02T21:35:46.943

I figured out a word-boundary based solution that would handle multiple cities, in case anyone might find it helpful in a similar situation:

def sub_mult_regex(text, keys, tag_type, city):
    '''
    Replaces/masks multiple words at once
    Parameters:
        text: TIU note
        keys: a list of words to be replaced by the regex
        tag_type: string you want the words to be replaced with
        city: bool, True if replacing cities, False if replacing anything else

    Creates a replacement dictionary of keys and values 
    (values are the length of the key, preserving formatting).

    Eg., {68 Oak St, PAddress PAddress PAddress}

    Returns text with relevant text masked
    '''

    # Creating a list of values to correspond with keys (see key:value example in docstring)

    if city:
        # If we're masking a city, handle word boundaries
        # This step of only including keys if they show up in the text speeds the code up by a lot, since it's not cross-referencing against thousands of cities, only the ones present
        keys = [r"\b"+key+r"\b" for key in keys if key in text or key.upper() in text] # add word boundaries for each key in list
        add_vals = []
        for val in keys:
            # Create dictionary of city word:PAddress by splitting the city on the '\\b' char that remains and then adding one tag per word
            # Ex: '\\bDeer Island\\b' --> split('\\b') --> ['', 'Deer Island', ''] --> ''.join --> (key) Deer Island : (value) PAddress PAddress
            add_vals.append(re.sub(r'\w{1,100}', tag_type, ''.join(val.split('\\b')))) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags
        add_vals = [re.sub(r'\\b', "", val) for val in add_vals]

    elif not city:
        # If we're not masking a city, we don't do the word boundary step
        add_vals = []
        for val in keys:
            add_vals.append(re.sub(r'\w{1,100}', tag_type, val)) # To preserve the precise punctuation, etc. formatting of the keys, only replacing word matches with tags

    # Zipping keys and values together as dictionary
    add_dict = dict(zip(keys, add_vals))
    print("add_dict:", add_dict)

    # Compiling the keys together (regex)
    add_subs = re.compile("|".join("("+key+")" for key in add_dict), re.IGNORECASE)

    # This is where the multiple substitutions are happening
    # Taken from: https://stackoverflow.com/questions/66270091/multiple-regex-substitutions-using-a-dict-with-regex-expressions-as-keys

    group_index = 1
    indexed_subs = {}
    for target, sub in add_dict.items():
        indexed_subs[group_index] = sub
        group_index += re.compile(target).groups + 1
    if len(indexed_subs) > 0:
        text_sub = re.sub(add_subs, lambda match: indexed_subs[match.lastindex], text) # text_sub is masked text

    else:
        text_sub = text # Not all texts have names, so text_sub would've been NoneType and broken funct otherwise

    # Information on what words were changed pre and post masking (eg., would return 'ANN ARBOR' if that city was masked here)

    case_a = text
    case_b = text_sub

    diff_list = [li for li in difflib.ndiff(case_a.split(), case_b.split()) if li[0] != ' ']
    diff_list = [re.sub(r'[-,]', "", term.strip()) for term in diff_list if '-' in term]

 
    return text_sub, diff_list

# sample call:
add_string = 'The cities are Round O NJ, around others and East Orange'
mask_multiword_cities(add_string) # this function remained the same 

# output: add_dict: {'\\bEast Orange\\b': 'PAddress PAddress', '\\bRound O\\b': 'PAddress PAddress'} ('The cities are PAddress PAddress NJ, around others are PAddress PAddress', [' Round', ' O', ' East', ' Orange'])

Specifying word boundaries for multiple string replacement with regex?

1 Answers1