Efficiently make many multiple substitutions in a string

Question

People have addressed before how to make multiple substitutions in a string based on a dictionary (see, for example). There seems to be a group of options based on string.replace and a group of options based on regular expressions, with a couple more options.

But I am interested in the efficiency of the different methods depending on the size of the dictionary, which I found to have a very important impact.

my_subs = {'Hello world': 'apple', 'is': 'ship'}
string = 'Hello world! This is nice.'
new_string = my_efficient_method(string, my_subs)

To be clear, this question is not about how to make these substitutions, but about which method is more efficient in which conditions, and which caveats apply. In particular, I am looking for the most practical approach when there are many (>100k) strings that are long (10-20k characters) and the dictionary is huge (>80k pairs). Under these circumstances the preferred methods based on regular expressions perform very poorly.

What should the answer do if subs have overlap? For example `"abc", {"ab":"x","bc":"y"}`? What should the answer do if replace output may be another replace input? For example `"a", {"a":"b","b":"a"}`? — tsh, Apr 12 '22 at 05:33
You had given size of replacement dictionaries, but not the test string. Based on the size of input string, fastest algorithm may vary. — tsh, Apr 12 '22 at 05:37
Good points. Length of documents vary, but in my case they were in the order of magnitude of "a couple of pages". I don't know why I didn't include this info here. Regarding the other question: in my particular example those situations would not really happen (when thinking in "whole words"), I guess it would be fair to point out different best algorithms depending on your specific needs regarding those. — Pablo, Apr 13 '22 at 06:38
When thinking in "whole words", you just need change `"abc", {"ab":"x","bc":"y"}` into `"there is a", {"there is":"there was","is a":"isn't a"}`. Otherwise, a simple `split -> lookup -> join` method would just works with a much better performance. — tsh, Apr 13 '22 at 07:15

Pablo · Accepted Answer · 2019-11-27T16:52:41.687

As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.

Short dictionary (847 substitution pairs)
Medium dictionary (2528 pairs)
Long dictionary (80430 pairs)

For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.

My string is between 28k and 29k characters.

All times given in seconds.

UPDATE: Flashtext

A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.

Regular Expressions

There are many variations, but the best tend to be very similar to this:

import re
rep = dict((re.escape(k), v) for k, v in my_dict.items())
pattern = re.compile("|".join(rep.keys()))
new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)

Execution times were:

1.63
5.03
7.7

Replace

This method simply applies string.replace in a loop. (Later I talk about problems with this.)

for original, replacement in self.my_dict.items():
    string = string.replace(original, replacement)

This solution proposes a variation using reduce, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression

reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])

equals ((((1+2)+3)+4)+5)

import functools
new_string = functools.reduce(lambda a, k: a.replace(*k), 
                              my_dict.items(), string)

Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace.

[string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]

Execution times were (in parenthesis results for reduce and assignment expressions variants):

1.37 (1.39) (1.50)
4.10 (4.12) (4.07)
1.9 (1.8) (no Python 3.8 in machine)

Recursive Lambda

This proposal involves using a recursive Lambda.

mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
new_string = mrep(string, my_dict)

Execution times were:

0.07
RecursionError
RecursionError

Practical remarks

See the update above: Flashtext is much faster than the other alternatives.

You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.

Regular expressions offer more control over your substitutions. For example, you may use \b before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.

Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.

I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:

string = 'This is: an island'
my_dict = {'is': 'is not', 'an island': 'a museum'}

Using replace and regular expressions I get string = ' This is : an island ' so that my replace loop

for original, replacement in self.my_dict.items():
    string = string.replace(f' {original} ', f' {replacement} ')

returns ' This is not : a museum ' as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.

(a) your `reduce`, `replace`, and `assignment expression` strategies basically do the same thing, so they all suffers the same problems as `replace` in a loop, (b) interesting how the large dict is faster, whereas I'd expect it to be slower; I guess there are many long strings that get replaced by short strings, so the string gets much shorter early on? — tobias_k, Nov 27 '19 at 15:41
You are right about reduce and replace, they are basically the same thing. Perhaps I was not clear enough: with the shorter dictionaries I timed the repetition of the method 50 times! In the long one it is a single run, so much, much slower. Also note the edit with the SUPER FAST Flashtext libary! — Pablo, Nov 27 '19 at 15:46
Yes, I missed that. Might add that directly to the "results". Also, I'd suggest removing all the `replace` variants except for the basic loop, as all those are just variations of the same theme (that are harder to read, or have other problems). — tobias_k, Nov 27 '19 at 15:50
Not sure why the "recursive lambda" is so fast, but you could do the same in a more readable way and without recursion problems as `while d: s = s.replace(*d.popitem())` — tobias_k, Nov 27 '19 at 15:52
BTW, just wanted to suggest using a Trie, but that seems to be exactly what Flashtext is doing. — tobias_k, Nov 27 '19 at 15:59
Perhaps I could not remove replace variants altogether but mention them in the same section. I wanted to mention them because I found them in other places and it was not immediately obvious to me that they would work exactly in the same way behind the scenes. — Pablo, Nov 27 '19 at 16:45
I have updated the answer using your input, thanks for helping :) — Pablo, Nov 27 '19 at 16:55

tsh · Answer 2 · 2022-04-14T02:05:56.873

By compile your replacement table into a Finite State Machine. You could finish the replacement in a single phase char by char scan. And using pre-compiled FSM would boost your performance if your would apply same replacement table on many different strings.

Comparing to other methods:

The FSM is similar to a Trie: Replacement values of non-leaf nodes are pre-calculated too. So you won't need to backtrack your Trie to save some more time.
The FSM is similar to a pre-compiled regex: Since regex is just a friendly form of FSM.
The FSM is also similar to the replace method: Searching substrings using KMP algorithm is actually a FSM. We just build multiple FSM's into a single one.

The time complexity of compile is O(mk²) where m is entries of replacement table and k is the length of a single replacement rule. The time complexity of replace is O(n+n') where n is the length of input, n' is the length of output.

Just to be clarify: In codes below, when multiple rules matched on same characters, the one begins earliest applied. In case a tie, the one longest applied.

import typing

mismatch = ''

ConvertTree = typing.Dict[str, typing.Tuple[str, 'ConvertTree']]
def compile(rulePairs:typing.Iterable[typing.Tuple[str, str]]) -> ConvertTree:
    # Build Trie
    root: ConvertTree = {}
    for src, dst in rulePairs:
        current = root
        for ch in src:
            if ch not in current:
                node: ConvertTree = {}
                node[mismatch] = None
                current[ch] = ('', node)
            current = current[ch][1]
        if not current.get(mismatch, None):
            current[mismatch] = (dst, root)
        else:
            old_dst = current[mismatch][0]
            if dst != old_dst:
                raise KeyError(f"Found conflict rules:\n  * {src} -> {old_dst}\n  * {src} -> {dst}")

    # Fill non-leaf nodes
    node_with_suffix: typing.List[typing.Tuple[ConvertTree, str, str]] = [(root, '', '')]
    for node, output, suffix in node_with_suffix:
        node_output, node_suffix = output, suffix;
        if node.get(mismatch, None):
            leaf = node[mismatch]
            node_output, node_suffix = leaf[0], ''
        for key in node:
            if key == mismatch: continue
            val = node[key]
            next_node = val[1]
            if next_node is root: continue
            if len(next_node) == 1:
                node[key] = next_node[mismatch]
            else:
                node_with_suffix.append((next_node, node_output, node_suffix + key))
        if not node_suffix: continue
        node_next = root
        for ch in node_suffix:
            while node_output != '' and ch not in node_next and node_next is not root:
                ref_output, node_next = node_next[mismatch]
                node_output += ref_output
            if not node_output:
                node_output += ch
            elif ch in node_next:
                ref_output, node_next = node_next[ch]
                node_output += ref_output
                break
            elif node_next is root:
                node_output += ch
                break
        node[mismatch] = (node_output, node_next)

    return root

def replace(input_text: str, root: ConvertTree) -> str:
    current = root
    output_arr = []
    for ch in input_text:
        while True:
            try:
                output, current = current[ch]
                output_arr.append(output)
            except:
                if current is root:
                    output_arr.append(ch)
                else:
                    output, current = current[mismatch]
                    output_arr.append(output)
                    continue
            break
    while current is not root:
        output, current = current['']
        output_arr.append(output)
    return ''.join(output_arr)

my_subs = [('Hello world', 'apple'), ('is', 'ship')]
string = 'Hello world! This is nice.'
new_string = replace(string, compile(my_subs))
print(new_string) # "apple! Thship ship nice."

Please leave a comment if you had find out any bugs of by code. I will try my best to fix them if there are any.

I used the same algorithm on my novel reader which use string replacements to convert between Chinese Simplify and Traditional variance. The related Python code may also be found on my GitHub repo (Chinese).