When text is same as the dict key, replace text to the dict value. However, don't replace text when text is same as the value in a list

Question

How to replace text with dict but do not replace text that find in skip_words list?

my_text = "HelloWorld foobar Hello bar..."

my_dict = {
    "Hello": "Hi",
    "bar": "rab",
    ...
}

skip_words = ["HelloWorld", "foobar"]

for a, b in my_dict.items():
    my_text = my_text.replace(a, b)

I want to replace Hello -> Hi and bar -> rab, but I don't want to replace HelloWorld and foobar because them are in the skip_words list.

I was commenting your other post but you deleted it... It depends on the length of the stringd and if you want the code to be general. Imagine that you have a string with 1 millions characters, and 100 replacements to perform. You will need to read ~100M characters. In addition, the order of the replacements might affect the output. If your replacements are only words, a better solution might be to split and check each word. I don't have the perfect answer without knowing all the details. What you could do it time the multiple replacements in your conditions and see if this is acceptable. — mozway, Nov 07 '21 at 20:47

score 1 · Accepted Answer · answered Oct 25 '21 at 11:19

One way would be to make a simple regex substitution with a replacement function;


import re

my_dict = { "Dog": "dog", "Cat": "cat" }
skip_words = set(["The Dog", "The Cat"])

result = re.sub(
    f'({"|".join(skip_words)}|{"|".join(my_dict.keys())})', 
    lambda x:x.group() if x.group() in skip_words else my_dict[x.group()], 
    "The Dog is Dog Dog Dog..."
)

print(result)

>>> The Dog is dog dog dog...

A short explanation;

f'({"|".join(skip_words)}|{"|".join(my_dict.keys())})',

Creates a regex string to match on, consisting of all skip words (first) and then all replacement words. The regex will match on any of these.

lambda x:x.group() if x.group() in skip_words else my_dict[x.group()],

A function that returns the word(s) itself for words in skip_words or the looked up version from my_dict for any other matched words. That means, the skip words are not replaced, the other matches are.

Note that I placed the skip words in a set for easier and more efficient lookup.

mozway · Answer 2 · 2021-10-25T11:11:18.673

0

Do not use replace in a loop, this is highly inefficient as you need to read again the whole string for each substitution.

Rather craft a regex and pass each match to a function to map the values to your dictionary:

import re
regex = '|'.join('(?<!The )%s' % w for w in my_dict.keys())
re.sub('(%s)' % regex, lambda x: my_dict[x.group()], my_text)

output: 'The Dog is dog dog dog...'

Or, alternatively, split the text in words and test the match on each word. Note that this works only on full independent words (see the last "Dog" doesn't get replace due to the "..."):

' '.join(my_dict.get(w, w) for w in my_text.split())

output: 'The dog is dog dog Dog...'

edited Oct 25 '21 at 11:11

answered Oct 25 '21 at 10:51

mozway

194,879
13
39
75

Thank you, but how to not replace text that find in `skip_words`? – OAO Oct 25 '21 at 11:01
Will the exception always be when `The ` is preceding? – mozway Oct 25 '21 at 11:02
I think yes..? In fact, the characters in text are all chinese – OAO Oct 25 '21 at 11:08
@OAO check my update – mozway Oct 25 '21 at 11:11
Oh, sorry. It is not always preceding with `The `. But thanks – OAO Oct 25 '21 at 11:21

score 0 · Answer 3 · answered Oct 25 '21 at 11:06

You should find or create a pattern in the text for yourself.

For example, it you want to replace all "Dog"s which came after a "The", you can do something like this:

In [1]: import re
In [2]: re.sub(r"(?<!\bThe\W)\bDog", "dog", text)
Out[2]: 'The Dog is dog dog dog...'

This is called a Negative Lookbehind

When text is same as the dict key, replace text to the dict value. However, don't replace text when text is same as the value in a list

3 Answers3