What is the most efficient way to replace a list of words without touching html attributes?

Question

I absolutely disagree that this question is a duplicate! I am asking for an efficiency way to replace hundreds of words at once. This is an algorithm question! All the provided links are about to replace one word. Should I repeat that expensive operation hundreds of times? I'm sure that there are better ways as a suffix tree where I sort out html while building that tree. I removed that regex tag since for no good reason you are focusing on that part.

I want to translate a given set of words (more then 100) and translate them. My first idea was to use a simple regular expression that works better then expected. As sample:

const input = "I like strawberry cheese cake with apple juice"
const en2de = {
    apple: "Apfel",
    strawberry: "Erdbeere",
    banana: "Banane",
    /* ... */}
input.replace(new RegExp(Object.keys(en2de).join("|"), "gi"), match => en2de[match.toLowerCase()])

This works fine on the first view. However it become strange if you words which contains each other like "pineapple" that would return "pineApfel" which is totally nonsense. So I was thinking about checking word boundaries and similar things. While playing around I created this test case:

<a href="https://www.apple.com">Apple</a> is a company

That created the output:

<a href="https://www.Apfel.com">Apfel</a> is a company.

The translation is wrong, which is somehow tolerable, but the link is broken. That must not happen.

So I was thinking about extend the regex to check if there is a quote before. I know well that html parsing with regex is a bad idea, but I thought that this should work anyway. In the end I gave up and was looking for solutions of other devs and found on Stack Overflow a couple of questions, all without answers, so it seems to be a hard problem (or my search skills are bad).

So I went two steps back and was thinking to implement that myself with a parser or something like that. However since I have multiple inputs and I need to ignore the case I was thinking what the best way is.

Right now I think to build a dictionary with pointers to the position of the words. I would store the dict in lower case so that should be fast, I could also skip all words with the wrong prefix etc to get my matches. In the end I would replace the words from the end to the beginning to avoid breaking the indices. But is that way efficiency? Is there a better way to achieve that?

While my sample is in JavaScript the solution must not be in JS as long the solution doesn't include dozens of dependencies which cannot be translated easy to JS.

TL;DR:

I want to replace multiple words by other words in a case insensitive way without breaking html.

"I know well that html parsing with regex is a bad idea" -- yep. "it seems to be a hard problem" -- yep. "implement that myself with a parser" -- why not use a parser already written? "the solution must not be in JS" -- what language should it be in, then? Anything but JS? — ggorlen, Jun 09 '20 at 19:34
See [1](https://stackoverflow.com/questions/61198344/replace-all-instances-of-multiple-words-on-the-entire-page/61198587#61198587) and [2](https://stackoverflow.com/a/59582082/6243352) which show how to do these replacements in JS ignoring HTML tags. But these links should hopefully indicate that you may have an [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Likely, there is a better way to achieve this, say, processing the text before it winds up in an HTML string. — ggorlen, Jun 09 '20 at 19:41
There is no better language than JS to work with html dom. Iterate over html nodes and run your code over `text` nodes only. But this wouldn't be bulletproof. `\b` could have a misleading name: _word boundary_ which is in terms of programming languages definition not a natural language. You should work around different type of occurrences like `mother-in-law` or `if I were a _mother_` as in the former you may match `mother` alone and in the latter `mother` wouldn't match `\bmother\b`. — revo, Jun 09 '20 at 19:42
@ggorlen I never tried to parse the html with regex, I was just checking boundaries. Both links are on the browser. I'm processing inputs which might contain html in node.js. Do you really think that there is a plain parser for my problem? I don't think so. I'm no fan of huge libraries for more or less simple tasks. — rekire, Jun 09 '20 at 19:50
@revo interesting point that I should just parse the html and work with the nodes. Is that efficiency? I don't care at all about the html and its logical attributes. I think that is an overhead — rekire, Jun 09 '20 at 19:53
Given an arbitrarily complex string, distinguishing content in HTML tags from content that isn't in HTML tags [isn't](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) a "more or less simple task". It's incredibly difficult (at least with the wrong tool, i.e. regex). If you're using node, there are plenty of great HTML parsing libraries like [cheerio](https://cheerio.js.org/) that are basically drop-ins for what the browser offers. Also, you should tag your question [tag:nodejs] if that's the environment you're working in. — ggorlen, Jun 09 '20 at 19:54
Start with it and see if it is _efficient_ as you want. As long as you are dealing with html content and you don't want to mess it up, this is the only way. — revo, Jun 09 '20 at 19:57
You got me wrong html parsing is complex. I know the pitfalls about html and regex, so I tried to keep that out of the question. That is why I want to avoid it. My original problem is to replace a string. So assumption that I might have an xy problem should be wrong. I intended to find out what is the most efficiency or in fact fastest executed solution. I think that parsing the html and building the dom takes a lot of time. I think that the multiple replacements are the bigger problem — rekire, Jun 09 '20 at 20:01
Efficiency is useless if you can't get correctness, and you won't get correctness with regex, nor efficiency, for that matter, unless your string replacement is extremely trivial which it doesn't sound like it is. The xy problem makes good sense here because mutable data is usually serialized to HTML as a final step and is only manipulated thereafter using the DOM. It's poor design to serialize HTML only to have to modify it again, unless there is absolutely no other choice, and that hasn't been explained. — ggorlen, Jun 09 '20 at 20:06
Well you asked for it: I'm trying to aggregate live data from multiple sources which I'm loading on demand. Since I'm too lazy to use a real dictionary I want to translate it into my (or the visitor's) language. For my very specific dictionary I have also pictures which I include as some kind of tooltip. Since I got live data which contains my "translated" words in urls the resulting html exploded. That is the story behind, but I think it doesn't matter. :-) Regex was just my first native try, but I'm not even thinking about to continue this path. — rekire, Jun 09 '20 at 20:14

score 1 · Answer 1 · answered Jun 11 '20 at 05:53

You may try a treeWalker and replace the text inplace.

To find words you may tokenize your text, lower case your words and map them.

const mapText = (dic, s) => {
  return s.replace(/[a-zA-Z-_]+/g, w => {
    return dic[w.toLowerCase()] || w
  })
}
const dic = {
  'grodzi': 'grodzila',
  'laaaa': 'forever',
}
const treeWalker = document.createTreeWalker(
  document.body,
  NodeFilter.SHOW_TEXT
)

// skip body node
let currentNode = treeWalker.nextNode()
while(currentNode) {
  const newS = mapText(dic, currentNode.data)
  currentNode.data = newS
  currentNode = treeWalker.nextNode()
}

p {background:#eeeeee;}

<p>
  <a href="grodzi">grodzi</a>
  <a href="laaaa">LAAAA</a>
</p>

The link stay untouched.

However mapping each word in an other language is bound to fail (be it missing representation of some word, humour/irony, or simply grammar construct). For this matter (which is a hard problem on its own) you may rely on some tools to translate data for you (neural networks, api(s), ...)

Thank you for pointing out the tree walker I will check this I never read before that api. I know that there are tones of grammar or plural pitfalls, but I think it should be fine for my usecase. — rekire, Jun 11 '20 at 09:06

score 0 · Answer 2 · answered Jun 11 '20 at 09:14

0

Here is my current work in progress solution of a suffix tree (or at least how I interpreted it). I'm building a dictionary with all words, which are not inside of a tag, with their position. After sorting the dict I replace them all. This works for me without handling html at all.

function suffixTree(input) {
    const dict = new Map()
    let start = 0
    let insideTag = false
    // define word borders
    const borders = ' .,<"\'(){}\r\n\t'.split('')
    // build dictionary
    for (let end = 0; end <= input.length; end++) {
        const c = input[end]
        if (c === '<') insideTag = true
        if (c === '>') {
            insideTag = false
            start = end + 1
            continue
        }
        if (insideTag && c !== '<') continue
        if (borders.indexOf(c) >= 0) {
            if(start !== end) {
                const word = input.substring(start, end).toLowerCase()
                const entry = dict.get(word) || []
                // save the word and its position in an array so when the word is
                // used multiple times that we can use this list
                entry.push([start, end])
                dict.set(word, entry)
            }
            start = end + 1
        }
    }
    // last word handling
    const word = input.substring(start).toLowerCase()
    const entry = dict.get(word) || []
    entry.push([start, input.length])
    dict.set(word, entry)

    // create a list of replace operations, we would break the
    // indices if we do that directly
    const words = Object.keys(en2de)
    const replacements = []
    words.forEach(word => {
        (dict.get(word) || []).forEach(match => {
            // [0] is start, [1] is end, [2] is the replacement
            replacements.push([match[0], match[1], en2de[word]])
        })
    })
    // converting the input to a char array and replacing the found ranges.
    // beginning from the end and replace the ranges with the replacement
    let output = [...input]
    replacements.sort((a, b) => b[0] - a[0])
    replacements.forEach(match => {
        output.splice(match[0], match[1] - match[0], match[2])
    })
    return output.join('')
}

Feel free to leave a comment how this can be improved.

answered Jun 11 '20 at 09:14

rekire

47,260
30
167
264

It fails on many occasions. You are implying a `<` means beginning of an html tag. How would you handle `if (1 <= 2) { ... }` then? – revo Jun 11 '20 at 09:42
the parsing problem apart, I find it odd to iterate over all words of the dictionary which is likely to be a superset of the words found in doc – grodzi Jun 11 '20 at 09:44
also I don't think storing word to indices to avoid multiple lookup (if multiple occurrence of same word) will prove to be more efficient than simply accessing the dictionnary every time (which is meant to be O(1)) – grodzi Jun 11 '20 at 09:48
You didn't like to use a dom parser to not sacrifice performance, now you have two problems. – revo Jun 11 '20 at 09:50
At first thank you guys for this review. I really didn't expect that. @revo good point, but that is out of my view not a valid (x)html I assume that it should be encoded as html entity. I'll play with another implementation to compare the timings. What was the second problem? I missed it somehow. – rekire Jun 11 '20 at 11:01
@grodzi I see right no problem with the list sizes. I need to find the overlapping parts. Can you explain that? Regarding the complexity of the Map with multiple "hits" I when I write a key a second time the first entry will be deleted, isn't it? If you mean that I should not store the hits and replace them directly, that would be interesting. I need to think about how to keep the scan index (in my code `end`) correctly. – rekire Jun 11 '20 at 11:02
yes @rekire I think you may avoid all the storage thing. If you really want to parse html yourself (well just skip the tags), you may process the input string, output the tag, concatenate your mapped words and so forth. No storage stuff just some consecutive push – grodzi Jun 11 '20 at 11:15
an example be like https://jsfiddle.net/p5z9kd8n/ (I don't really want to add that as an answer since trying to match properly tags is a pitfall of edits...) – grodzi Jun 11 '20 at 12:09
1

Interesting solution. I will add it to my test suite for comparison. I'll will publish the results when I implemented all the variations mentioned before. – rekire Jun 11 '20 at 17:18

What is the most efficient way to replace a list of words without touching html attributes?

2 Answers2