3

I have a string and rules/mappings for replacement and no-replacements.

E.g.

"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."

Replacement rules:

replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}

Result:

"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

Additional criteria:

  1. Only replace if case is matched, i.e. case matters.
  2. Whole words replacement only, interpunction should be ignored, but kept after replacement.

I was thinking what would the cleanest way to solve this problem in Python 3.x be?

Jovan Andonov
  • 436
  • 3
  • 12
  • Did you try building a regex to replace this? https://docs.python.org/3/howto/regex.html#search-and-replace – mpSchrader Oct 15 '20 at 12:26

2 Answers2

1

Based on the answer of demongolem.

UPDATE

I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.

def replace_whole(sentence, replace_token, replace_with, dont_replace):
    rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
    iter = re.finditer(rx, sentence)
    out_sentence = ""
    found = []
    indices = []
    for m in iter:
        indices.append(m.start(0))
        found.append(m.group())

    context_size=len(dont_replace)
    for i in range(len(indices)):
        context = sentence[indices[i]-context_size:indices[i]+context_size]
        if dont_replace in context:
            continue
        else:
            # First replace the word only in the substring found
            to_replace = found[i].replace(replace_token, replace_with)
            # Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
            replace_val = context.replace(found[i], to_replace)
            # finally replace the context found with the replacing context
            out_sentence = sentence.replace(context, replace_val)
            
    return out_sentence

Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string. If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.

Using your examples, this leads to:

replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

and

replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
MichaelJanz
  • 1,775
  • 2
  • 8
  • 23
  • I like the approach and I think I can generalize it to my case. However, r'(sentence)' doesn't really work for me. If you run your code on my second example it will not work. – Jovan Andonov Oct 15 '20 at 18:22
  • Thats weird. I will have a look at it – MichaelJanz Oct 16 '20 at 08:17
  • Ah I see, i missed that only whole words should be replaced, sorry. I updated my answer, so it suits your needs – MichaelJanz Oct 16 '20 at 08:50
  • At the end I went for a slightly different approach, that works better in my case and I believe to be cleaner, but I would like to thank you very much for the effort nonetheless! – Jovan Andonov Oct 18 '20 at 16:29
  • Glad you found a better suiting solution. It would be good practise to upvoty my answer, as it helped you further and to honor my effort – MichaelJanz Oct 19 '20 at 07:22
0

After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.

replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.

def match_fun(match: re.Match):
    str_match: str = match.group()

    if str_match not in cls.no_replace_dict:
        return cls.replace_dict[str_match]
    
    for no_replace in cls.no_replace_dict[str_match]:
            
        no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
        for no_replace_match in no_replace_matches_iter:

            if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
                return str_match
            
            if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
                return str_match
        
    return cls.replace_dict[str_match]

for replace in cls.replace_dict:
    pattern = re.compile(r'\b' + replace + r'\b')
    text = pattern.sub(match_fun, text)
Jovan Andonov
  • 436
  • 3
  • 12