0

How to replace text with dict but do not replace text that find in skip_words list?

my_text = "HelloWorld foobar Hello bar..."

my_dict = {
    "Hello": "Hi",
    "bar": "rab",
    ...
}

skip_words = ["HelloWorld", "foobar"]

for a, b in my_dict.items():
    my_text = my_text.replace(a, b)

I want to replace Hello -> Hi and bar -> rab, but I don't want to replace HelloWorld and foobar because them are in the skip_words list.

OAO
  • 71
  • 8
  • I was commenting your other post but you deleted it... It depends on the length of the stringd and if you want the code to be general. Imagine that you have a string with 1 millions characters, and 100 replacements to perform. You will need to read ~100M characters. In addition, the order of the replacements might affect the output. If your replacements are only words, a better solution might be to split and check each word. I don't have the perfect answer without knowing all the details. What you could do it time the multiple replacements in your conditions and see if this is acceptable. – mozway Nov 07 '21 at 20:47

3 Answers3

1

One way would be to make a simple regex substitution with a replacement function;


import re

my_dict = { "Dog": "dog", "Cat": "cat" }
skip_words = set(["The Dog", "The Cat"])

result = re.sub(
    f'({"|".join(skip_words)}|{"|".join(my_dict.keys())})', 
    lambda x:x.group() if x.group() in skip_words else my_dict[x.group()], 
    "The Dog is Dog Dog Dog..."
)

print(result)

>>> The Dog is dog dog dog...

A short explanation;

f'({"|".join(skip_words)}|{"|".join(my_dict.keys())})', 

Creates a regex string to match on, consisting of all skip words (first) and then all replacement words. The regex will match on any of these.

lambda x:x.group() if x.group() in skip_words else my_dict[x.group()], 

A function that returns the word(s) itself for words in skip_words or the looked up version from my_dict for any other matched words. That means, the skip words are not replaced, the other matches are.

Note that I placed the skip words in a set for easier and more efficient lookup.

Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
0

Do not use replace in a loop, this is highly inefficient as you need to read again the whole string for each substitution.

Rather craft a regex and pass each match to a function to map the values to your dictionary:

import re
regex = '|'.join('(?<!The )%s' % w for w in my_dict.keys())
re.sub('(%s)' % regex, lambda x: my_dict[x.group()], my_text)

output: 'The Dog is dog dog dog...'

Or, alternatively, split the text in words and test the match on each word. Note that this works only on full independent words (see the last "Dog" doesn't get replace due to the "..."):

' '.join(my_dict.get(w, w) for w in my_text.split())

output: 'The dog is dog dog Dog...'

mozway
  • 194,879
  • 13
  • 39
  • 75
0

You should find or create a pattern in the text for yourself.

For example, it you want to replace all "Dog"s which came after a "The", you can do something like this:

In [1]: import re
In [2]: re.sub(r"(?<!\bThe\W)\bDog", "dog", text)
Out[2]: 'The Dog is dog dog dog...'

This is called a Negative Lookbehind

Mohammad Jafari
  • 1,742
  • 13
  • 17