87

Suppose we have:

d = {
    'Спорт':'Досуг',
    'russianA':'englishA'
}

s = 'Спорт russianA'

How can I replace each appearance within s of any of d's keys, with the corresponding value (in this case, the result would be 'Досуг englishA')?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
  • 1
    This might not be so straightforward. You should probably have an explicit tokenizer (for example `{'cat': 'russiancat'}` and "caterpillar"). Also overlapping words (`{'car':'russiancar', 'pet' : 'russianpet'}` and 'carpet'). – Joe Mar 08 '10 at 10:15
  • 2
    Also see http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ – ChristopheD Mar 08 '10 at 13:12

8 Answers8

111

Using re:

import re

s = 'Спорт not russianA'
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
keys = (re.escape(k) for k in d.keys())
pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b')
result = pattern.sub(lambda x: d[x.group()], s)
# Output: 'Досуг not englishA'

This will match whole words only. If you don't need that, use the pattern:

pattern = re.compile('|'.join(re.escape(k) for k in d.keys()))

Note that in this case you should sort the words descending by length if some of your dictionary entries are substrings of others.

hobs
  • 18,473
  • 10
  • 83
  • 106
Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • 27
    In case the dictionary keys contain characters like "^", "$" and "/", the keys need to be escaped before the regular expression is assembled. To do this, `.join(d.keys())` could be replaced by `.join(re.escape(key) for key in d.keys())`. – jochen Nov 15 '12 at 18:05
  • Please note that the first example(Досуг not englishA) only works in python3. In python2 it still return me "Спорт not englishA" – 林果皞 Dec 30 '14 at 10:56
  • It seems to fail when word in dict has dot - `https://regex101.com/r/bliVUS/1` - I need to remove `\b` at the end but not sure it's correct. – Peter.k Mar 14 '19 at 14:33
  • By x2 factor the fastest way to do substitutions. It is also the most correct way to do it. – Galigator Jul 12 '23 at 16:52
25

You could use the reduce function:

reduce(lambda x, y: x.replace(y, dict[y]), dict, s)
MvG
  • 57,380
  • 22
  • 148
  • 276
codeape
  • 97,830
  • 24
  • 159
  • 188
  • 17
    Different to the solution by @Max Shawabkeh, using `reduce` applies the substitutions one after another. As a consequence, swapping words using dictionaries `{ 'red': 'green', 'green': 'red'}` does not work with the `reduce`-based approach, and overlapping matches are transformed in an unpredictable way. – jochen Nov 15 '12 at 18:10
  • 2
    A good example of why repeated `.replace()` calls may have unintended consequences: `html.replace('"', '"').replace('&', '&')`—try it on `html = '"foo"'`. – Mattie Jun 26 '13 at 13:07
  • 1
    This is unnecessarily complex and unreadable compared to the unfolded loop as shown in answers by [ChristopheD](https://stackoverflow.com/a/2401481/216074), or [user2769207](https://stackoverflow.com/a/18748467/216074). – poke Aug 07 '17 at 11:50
21

Solution found here (I like its simplicity):

def multipleReplace(text, wordDict):
    for key in wordDict:
        text = text.replace(key, wordDict[key])
    return text
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • 12
    Again, as @jochen described, this risks a bad translation if there is a key that is also a value. A single-pass replacement would be best. – Chris Feb 17 '13 at 16:03
5

one way, without re

d = {
'Спорт':'Досуг',
'russianA':'englishA'
}

s = 'Спорт russianA'.split()
for n,i in enumerate(s):
    if i in d:
        s[n]=d[i]
print ' '.join(s)
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
3

Almost the same as ghostdog74, though independently created. One difference, using d.get() in stead of d[] can handle items not in the dict.

>>> d = {'a':'b', 'c':'d'}
>>> s = "a c x"
>>> foo = s.split()
>>> ret = []
>>> for item in foo:
...   ret.append(d.get(item,item)) # Try to get from dict, otherwise keep value
... 
>>> " ".join(ret)
'b d x'
extraneon
  • 23,575
  • 2
  • 47
  • 51
1

With the warning that it fails if key has space, this is a compressed solution similar to ghostdog74 and extaneons answers:

d = {
'Спорт':'Досуг',
'russianA':'englishA'
}

s = 'Спорт russianA'

' '.join(d.get(i,i) for i in s.split())
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
0

I used this in a similar situation (my string was all in uppercase):

def translate(string, wdict):
    for key in wdict:
        string = string.replace(key, wdict[key].lower())
    return string.upper()

hope that helps in some way... :)

0

Using regex

We can build a regular expression that matches any of the lookup dictionary's keys, by creating regexes to match each individual key and combine them with |. We use re.sub to do the substitution, by giving it a function to do the replacement (this function, of course, will do the dict lookup). Putting it together:

import re

# assuming global `d` and `s` as in the question

# a function that does the dict lookup with the global `d`.
def lookup(match):
    return d[match.group()]

# Make the regex.
joined = '|'.join(re.escape(key) for key in d.keys())
pattern = re.compile(joined)

result = pattern.sub(lookup, s)

Here, re.escape is used to escape any characters with special meaning in the replacements (so that they don't interfere with building the regex, and are matched literally).

This regex pattern will match the substrings anywhere they appear, even if they are part of a word or span across multiple words. To avoid this, modify the regex so that it checks for word boundaries:

# pattern = re.compile(joined)
pattern = re.compile(rf'\b({joined})\b')

Using str.replace iteratively

Simply iterate over the .items() of the lookup dictionary, and call .replace with each. Since this method returns a new string, and does not (cannot) modify the string in place, we must reassign the results inside the loop:

for to_replace, replacement in d.items():
    s = s.replace(to_replace, replacement)

This approach is simple to write and easy to understand, but it comes with multiple caveats.

First, it has the disadvantage that it works sequentially, in a specific order. That is, each replacement has the potential to interfere with other replacements. Consider:

s = 'one two'
s = s.replace('one', 'two')
s = s.replace('two', 'three')

This will produce 'three three', not 'two three', because the 'two' from the first replacement will itself be replaced in the second step. This is normally not desirable; however, in the rare case when it should work this way, this approach is the only practical one.

This approach also cannot easily be fixed to respect word boundaries, because it must match literal text, and a "word boundary" can be marked in multiple different ways - by varying kinds of whitespace, but also without text at the beginning and end of the string.

Finally, keep in mind that a dict is not an ideal data structure for this approach. If we will iterate over the dict, then its ability to do key lookup is useless; and in Python 3.5 and below, the order of dicts is not guaranteed (making the sequential replacement problem worse). Instead, it would be better to specify a list of tuples for the replacements:

d = [('Спорт', 'Досуг'), ('russianA', 'englishA')]
s = 'Спорт russianA'

for to_replace, replacement in d: # no more `.items()` call
    s = s.replace(to_replace, replacement)

By tokenization

The problem becomes much simpler if the string is first cut into pieces (tokenized), in such a way that anything that should be replaced is now an exact match for a dict key. That would allow for using the dict's lookup directly, and processing the entire string in one go, while also not building a custom regex.

Suppose that we want to match complete words. We can use a simpler, hard-coded regex that will match whitespace, and which uses a capturing group; by passing this to re.split, we split the string into whitespace and non-whitespace sections. Thus:

import re

tokenizer = re.compile('([ \t\n]+)')
tokenized = tokenizer.split(s)

Now we look up each of the tokens in the dictionary: if present, it should be replaced with the corresponding value, and otherwise it should be left alone (equivalent to replacing it with itself). The dictionary .get method is a natural fit for this task. Finally, we join the pieces back up. Thus:

s = ''.join(d.get(token, token) for token in tokenized)

More generally, for example if the strings to replace could have spaces in them, a different tokenization rule will be needed. However, it will usually be possible to come up with a tokenization rule that is simpler than the regex from the first section (that matches all the keys by brute force).

Special case: replacing single characters

If the keys of the dict are all one character (technically, Unicode code point) each, there are more specific techniques that can be used. See Best way to replace multiple characters in a string? for details.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153