16

I'd like a Javascript Regex to wrap a given list of of words in a given start (<span>) and end tag (i.e. </span>), but only if the word is actually "visible text" on the page, and not inside of an html attribute (such as a link's title tag, or inside of a <script></script> block.

I've created a JS Fiddle with the basics setup: http://jsfiddle.net/4YCR6/1/

m14t
  • 440
  • 1
  • 4
  • 11
  • See http://stackoverflow.com/questions/3241169/highlight-search-terms-select-only-leaf-nodes – Ryan Mar 04 '13 at 22:38
  • As the others said, its usually not the best idea to handle HTML with a regex. But there are cases where its just the easiest way. Try this: [updated jsfiddle](http://jsfiddle.net/4YCR6/4/) On [rubular](http://rubular.com/r/NMazXVpKfE) – morja May 05 '11 at 22:58

2 Answers2

41

HTML is too complex to reliably parse with a regular expression.

If you're looking to do this client-side, you can create a document fragment and/or disconnected DOM node (neither of which is displayed anywhere) and initialize it with your HTML string, then walk through the resulting DOM tree and process the text nodes. (Or use a library to help you do that, although it's actually quite simple.)

Here's a DOM walking example. This example is slightly simpler than your problem because it just updates the text, it doesn't add new elements to the structure (wrapping parts of the text in spans involves updating the structure), but it should get you going. Notes on what you'll need to change at the end.

var html =
    "<p>This is a test.</p>" +
    "<form><input type='text' value='test value'></form>" +
    "<p class='testing test'>Testing here too</p>";
var frag = document.createDocumentFragment();
var body = document.createElement('body');
var node, next;

// Turn the HTML string into a DOM tree
body.innerHTML = html;

// Walk the dom looking for the given text in text nodes
walk(body);

// Insert the result into the current document via a fragment
node = body.firstChild;
while (node) {
  next = node.nextSibling;
  frag.appendChild(node);
  node = next;
}
document.body.appendChild(frag);

// Our walker function
function walk(node) {
  var child, next;

  switch (node.nodeType) {
    case 1:  // Element
    case 9:  // Document
    case 11: // Document fragment
      child = node.firstChild;
      while (child) {
        next = child.nextSibling;
        walk(child);
        child = next;
      }
      break;
    case 3: // Text node
      handleText(node);
      break;
  }
}

function handleText(textNode) {
  textNode.nodeValue = textNode.nodeValue.replace(/test/gi, "TEST");
}

Live example

The changes you'll need to make will be in handleText. Specifically, rather than updating nodeValue, you'll need to:

  • Find the index of the beginning of each word within the nodeValue string.
  • Use Node#splitText to split the text node into up to three text nodes (the part before your matching text, the part that is your matching text, and the part following your matching text).
  • Use document.createElement to create the new span (this is literally just span = document.createElement('span')).
  • Use Node#insertBefore to insert the new span in front of the third text node (the one containing the text following your matched text); it's okay if you didn't need to create a third node because your matched text was at the end of the text node, just pass in null as the refChild.
  • Use Node#appendChild to move the second text node (the one with the matching text) into the span. (No need to remove it from its parent first; appendChild does that for you.)
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
13

T.J. Crowder's answer is correct. I've gone a little further code-wise: here's a fully-formed example that works in all major browsers. I've posted variations of this code on Stack Overflow before (here and here, for example), and made it nice and generic so I (or anyone else) don't have to change it much to reuse it.

jsFiddle example: http://jsfiddle.net/7Vf5J/38/

Code:

// Reusable generic function
function surroundInElement(el, regex, surrounderCreateFunc) {
    // script and style elements are left alone
    if (!/^(script|style)$/.test(el.tagName)) {
        var child = el.lastChild;
        while (child) {
            if (child.nodeType == 1) {
                surroundInElement(child, regex, surrounderCreateFunc);
            } else if (child.nodeType == 3) {
                surroundMatchingText(child, regex, surrounderCreateFunc);
            }
            child = child.previousSibling;
        }
    }
}

// Reusable generic function
function surroundMatchingText(textNode, regex, surrounderCreateFunc) {
    var parent = textNode.parentNode;
    var result, surroundingNode, matchedTextNode, matchLength, matchedText;
    while ( textNode && (result = regex.exec(textNode.data)) ) {
        matchedTextNode = textNode.splitText(result.index);
        matchedText = result[0];
        matchLength = matchedText.length;
        textNode = (matchedTextNode.length > matchLength) ?
            matchedTextNode.splitText(matchLength) : null;
        // Ensure searching starts at the beginning of the text node
        regex.lastIndex = 0;
        surroundingNode = surrounderCreateFunc(matchedTextNode.cloneNode(true));
        parent.insertBefore(surroundingNode, matchedTextNode);
        parent.removeChild(matchedTextNode);
    }
}

// This function does the surrounding for every matched piece of text
// and can be customized  to do what you like
function createSpan(matchedTextNode) {
    var el = document.createElement("span");
    el.style.color = "red";
    el.appendChild(matchedTextNode);
    return el;
}

// The main function
function wrapWords(container, words) {
    // Replace the words one at a time to ensure "test2" gets matched
    for (var i = 0, len = words.length; i < len; ++i) {
        surroundInElement(container, new RegExp(words[i]), createSpan);
    }
}

wrapWords(document.getElementById("container"), ["test2", "test"]);
Community
  • 1
  • 1
Tim Down
  • 318,141
  • 75
  • 454
  • 536
  • this is just what i was looking for, how could i make this completely ignore case though? – Mike Mellor Jun 05 '14 at 15:08
  • 1
    @MikeMellor: Change `new RegExp(words[i], "g")` to `new RegExp(words[i], "gi")`. – Tim Down Jun 05 '14 at 15:20
  • god that was easy, i really should learn about regular expressions. Thanks Tim – Mike Mellor Jun 05 '14 at 15:22
  • @MikeMellor: Everyone should learn regular expressions :) – Tim Down Jun 05 '14 at 16:41
  • @TimDown: thanks for the code. However, it must be noted that it has a bug: it skips some matches. To fix it, `regex.lastIndex = 0;` must be added after the `textNode = ...` line in surroundMatchingText. – dmitrych Jul 13 '15 at 02:31
  • @dchervov: I don't think that's true, but I could be wrong. Have you got an example? – Tim Down Jul 13 '15 at 08:56
  • @TimDown: yes, see how it does not highlight one of the "test" words here: https://jsfiddle.net/7Vf5J/36/ . And it does highlight it properly with the suggested fix. lastIndex must be reset to zero because textNode is updated to the remainder of the string - so the search must start from the beginning of the remainder. – dmitrych Jul 13 '15 at 09:16
  • @dchervov: My apologies, you're right. I'll fix it. – Tim Down Jul 13 '15 at 09:27
  • @dchervov: Having had a closer look, I think changing the regular expression to not have its global flag set is a better fix and is how I originally intended the `surroundMatchingText()` function to work. – Tim Down Jul 13 '15 at 09:39
  • @dchervov: ... although your fix makes it work regardless of whether the regex's global flag is set. I'm coming round to your fix instead. – Tim Down Jul 13 '15 at 09:41
  • @TimDown: ok, thanks – dmitrych Jul 13 '15 at 17:59
  • @TimDown Thank you, this is great ! Any chance you could update it with an optional 3rd argument that would allow to replace text as well ? – Aerodynamic Oct 25 '21 at 22:57