Highlight a word of text on the page using .replace()

Question

I'm developing a Google Chrome extension that allows you to automatically apply a highlighting CSS rule to a word that you choose.

I have the following code

var elements = document.getElementsByTagName('*');

for (var i=0; i<elements.length; i++) {
    var element = elements[i];

    for (var j=0; j<element.childNodes.length; j++) {
        var node = element.childNodes[j];

        if(node.nodeType === 3) {
            var text = node.nodeValue;

            var fetchedText = text.match(/teste/gi);

            if(fetchedText) {
                var replacedText = element.innerHTML.replace(/(teste)/gi, "<span style=\"background-color: yellow\">$1</span>");

                if (replacedText !== text) {
                    element.innerHTML = replacedText;
                }
            }
        }
    }
}

Which breaks and freezes my Chrome tab. However, if I switch from element.innerHTML = replacedText; to element.innerHTML = "text"; this works.

I can't seem to find what's wrong with the following code.

@ScottMarcus when I log `replacedText` it shows the correct value which is, for instance, `teste`. However if I use this on `element.innerHTML` it crashes my tab. — rafaelcpalmeida, Nov 20 '16 at 23:59
Are you sure that the `\"` escape sequence is being shown in the console? Have you tried modifying the string to be: `"$1"`? — Scott Marcus, Nov 21 '16 at 00:02
@ScottMarcus Here's an example of whats being logged: `Teste de velocidade MEO`. I've changed from `"` to `'` and it still crashes the tab. — rafaelcpalmeida, Nov 21 '16 at 00:06
It seems that when you replace the content with your string, that string includes text that needs to be replaced. And, since that text is a child of the current node, it hasn't been processed by your loop yet. So, your loop then finds the new element, which needs to be processed, essentially creating an infinite loop. — Scott Marcus, Nov 21 '16 at 03:05

Makyen · Answer 1 · 2017-06-18T17:10:39.887

You are first testing the #text nodes to see if the text contains the word you are trying to highlight, but then performing the replacement on the .innerHTML of the parent element. There are a couple of problems with this.

Infinite replacements: When you modify the .innerHTML of the parent element you change the childNodes array. You do so in a way that has added a node further in the array containing the text which is to be replaced. Thus, when you continue scanning the childNodes array you always find a (new) node that contains the text you want to replace. So, you replace it again, creating another node that has a higher index in the childNodes array. Repeat infinitely.
Using a RegExp to replace text in the .innerHTML property. While you have already tested to make sure the text you desire to replace is actually contained in a text node, this does not prevent your RegExp from also replacing any matching words within the actual HTML of the element (e.g. in src="yourWord", href="http://foo.com/yourWord/bar.html", or if attempting to highlight words like style, color, background, span, id, height, width, button, form, input, etc.).
You are not checking to make sure you are not changing text in <script> or <style> tags.
You are checking that you only make changed in text nodes (i.e. you check for node.nodeType === 3). If you weren't checking for this you would also have the following possible problems due to using .innerHTML to change HTML:
- You could end up changing attributes, or actual HTML tags, depending on what you are changing with .replace(). This could completely disrupt the page layout and functionality.
- When you change .innerHTML the DOM for that portion of the page is completely recreated. This means the elements, while new elements might be the same type with the same attributes, any event listeners which were attached to the old elements will not be attached to the new elements. This can significantly disrupt the functionality of a page.
- Repeatedly changing large portions of the DOM can be quite compute intensive to re-render the page. Depending on how you do this, you may run into significant user-perceived performance issues.

Thus, if you are going to use a RegExp to replace the text, you need to perform the operation only on the contents of the #text node, not on the .innerHTML of the parent node. Because you are wanting to create additional HTML elements (e.g. new <span style=""> elements, with child #text nodes), there are some complications.

Can not assign HTML text to a text node to create new HTML nodes:

There is no way to assign new HTML directly to a text node and have it evaluated as HTML, creating new nodes. Assigning to a text node's .innerHTML property will create such a property on the Object (just like it would on any Object), but will not change the text displayed on the screen (i.e. the actual value of the #text node). Thus, it will not accomplish what you are wanting to do: it will not create any new HTML children of the parent node.

The way to do this that has the least impact on the page's DOM (i.e. least likely to break existing JavaScript on the page) is to create a <span> to include the new text nodes you are creating (the text that was in the #text node that is not in your colored <span>) along with the potentially multiple <span> elements you are creating. This will result in replacing a single #text node with a single <span> element. While this will create additional descendants, it will leave the number of children in the parent element unchanged. Thus, any JavaScript which was relying on that will not be affected. Given that we are changing the DOM, there is no way to not potentially break other JavaScript, but this should minimize that possibility.

Some examples of how you can do this: See this answer (replaces a list of words with those words in buttons) and this answer (places all text in <p> elements which is separated by spaces into buttons) for full extensions that perform regex replace with new HTML. See this answer which does basically the same thing, but makes a link (it has a different implementation which traverses the DOM with a TreeWalker to find #text nodes instead of a NodeIterator as used in the other two examples).

Here is code which will perform the replacement which you are desiring on each text node in the document.body and create the new HTML needed to have the style be different in a portion of the text:

function handleTextNode(textNode) {
    if(textNode.nodeName !== '#text'
        || textNode.parentNode.nodeName === 'SCRIPT' 
        || textNode.parentNode.nodeName === 'STYLE'
    ) {
        //Don't do anything except on text nodes, which are not children 
        //  of <script> or <style>.
        return;
    }
    let origText = textNode.textContent;
    let newHtml=origText.replace(/(teste)/gi
                                 ,'<span style="background-color: yellow">$1</span>');
    //Only change the DOM if we actually made a replacement in the text.
    //Compare the strings, as it should be faster than a second RegExp operation and
    //  lets us use the RegExp in only one place for maintainability.
    if( newHtml !== origText) {
        let newSpan = document.createElement('span');
        newSpan.innerHTML = newHtml;
        textNode.parentNode.replaceChild(newSpan,textNode);
    }
}

let textNodes = [];
//Create a NodeIterator to get the text nodes in the body of the document
let nodeIter = document.createNodeIterator(document.body,NodeFilter.SHOW_TEXT);
let currentNode;
//Add the text nodes found to the list of text nodes to process.
while(currentNode = nodeIter.nextNode()) {
    textNodes.push(currentNode);
}
//Process each text node
textNodes.forEach(function(el){
    handleTextNode(el);
});

There are other ways to do this. However, they will generate more significant changes to the structure of the children for that specific element (e.g. multiple additional nodes on the parent). Doing so has a higher potential of breaking any JavaScript already on the page which is relying on the current structure of the page. Actually, any change like this has the potential to break current JavaScript.

^{The code for in this answer was modified from the code in this other answer of mine}

Actually I haven't received that type of error because I was modifying the element that contained that text node. It worked fine if I replaced the content I wanted for anything else that the word I was looking for — rafaelcpalmeida, Nov 23 '16 at 11:14
@rafaelcpalmeida, Yep, my error in describing one portion of the problem (made assumptions, as usual a bad thing to do). I have updated the answer with a corrected description of the problems (doesn't change the solution). — Makyen, Nov 23 '16 at 13:03

score 0 · Accepted Answer · answered Nov 23 '16 at 11:19

The error I was experiencing was due to a recursive loop because, for instance, I was looking for the keyword teste and I was inserting a new element with the content <span style=\"background-color: #ffff00\">teste</span> which would force the script to try to replace the new keyword teste again and so on.

I came up with this function:

function applyReplacementRule(node) {
    // Ignore any node whose tag is banned
    if (!node || $.inArray(node.tagName, hwBannedTags) !== -1) { return; }

    try {
        $(node).contents().each(function (i, v) {
            // Ignore any child node that has been replaced already or doesn't contain text
            if (v.isReplaced || v.nodeType !== Node.TEXT_NODE) { return; }

            // Apply each replacement in order
            hwReplacements.then(function (replacements) {
                replacements.words.forEach(function (replacement) {
                    //if( !replacement.active ) return;
                    var matchedText = v.textContent.match(new RegExp(replacement, "i"));

                    if (matchedText) {
                        // Use `` instead of '' or "" if you want to use ${variable} inside a string
                        // For more information visit https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
                        var replacedText = node.innerHTML.replace(new RegExp(`(${replacement})`, "i"), "<span style=\"background-color: #ffff00\">$1</span>");

                        node.innerHTML = replacedText;
                    }
                });
            }).catch(function (reason) {
                console.log("Handle rejected promise (" + reason + ") here.");
            });

            v.isReplaced = true;
        });
    } catch (err) {
        // Basically this means that an iframe had a cross-domain source
        if (err.name !== "SecurityError")
        { throw err; }
    }
}

Where I modify the node property and "tell" that I've already modify that node so I don't end up on a recursive infinite loop again.

P.S. As you can see this solution uses jQuery. I'll try to rewrite this to use just Vanilla JS.

Your solution still uses a RegExp to change the `.innerHTML` of the parent element. As a result, this will still disrupt any HTML which contains the word you are replacing, when a text node also contains that word. In other words, while it does not make the replacement unless the replacement will happen in actual text, it does not prevent the replacement from also changing the HTML (e.g. in `src="yourWord"` or `href="http://foo.com/yourWord/bar.html"`). — Makyen, Nov 23 '16 at 13:12
Just a comment, not intended to be a criticism: You use two lines of comments to explain your use of a template literal. While nice to explain it, it does not make much sense to use it in this situation when you could have replaced it with `'(' + replacement + ')' `. Using straight string concatenation would not have made you feel like you needed the two lines of comments to explain and would not have limited your code to Chrome >= ver. 41. — Makyen, Nov 23 '16 at 13:19
FYI: You are currently iterating over a `words` list, each of which you are replacing. You use two different RegExp when you could use just one (the test for existence would not care about having the word in a capture group). It would be much more efficient to pre-create a single RegExp which includes all words in the array of `words`. Doing so would result in only performing a single `.replace()` for all of the words. This would save quite a bit of time in your inner loop. [This answer](http://stackoverflow.com/a/40576258/3773011) has an example of doing this. — Makyen, Nov 23 '16 at 13:39
@Makyen thanks for your comments and updates. I'll keep what you said in mind and try to rewrite my code to be more efficient — rafaelcpalmeida, Nov 24 '16 at 11:56

Highlight a word of text on the page using .replace()

2 Answers2

Linked

Related