54

I have a challenging problem to solve. I'm working on a script which takes a regex as an input. This script then finds all matches for this regex in a document and wraps each match in its own <span> element. The hard part is that the text is a formatted html document, so my script needs to navigate through the DOM and apply the regex across multiple text nodes at once, while figuring out where it has to split text nodes if needed.

For example, with a regex that captures full sentences starting with a capital letter and ending with a period, this document:

<p>
  <b>HTML</b> is a language used to make <b>websites.</b>
  It was developed by <i>CERN</i> employees in the early 90s.
</p>

Would ideally be turned into this:

<p>
  <span><b>HTML</b> is a language used to make <b>websites.</b></span>
  <span>It was developed by <i>CERN</i> employees in the early 90s.</span>
</p>

The script should then return the list of all created spans.

I already have some code which finds all the text nodes and stores them in a list along with their position across the whole document and their depth. You don't really need to understand that code to help me and its recursive structure can be a bit confusing. The first part I'm not sure how to do is figure out which elements should be included within the span.

function findTextNodes(node, depth = -1, start = 0) {
  let list = [];

  if (node.nodeType === Node.TEXT_NODE) {
    list.push({ node, depth, start });
  } else {
    for (let i = 0; i < node.childNodes.length; ++i) {
      list = list.concat(findTextNodes(node.childNodes[i], depth+1, start));
      if (list.length) {
        start += list[list.length-1].node.nodeValue.length;
      }
    }
  }

  return list;
}

I figure I'll make a string out of all the document, run the regex through it and use the list to find which nodes correspond to witch regex matches and then split the text nodes accordingly.

But an issue arrives when I have a document like this:

<p>
  This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a>
</p>

There's a sentence which starts outside of the <a> tag but ends inside it. Now I don't want the script to split that link in two tags. In a more complex document, it could ruin the page if it did. The code could either wrap two sentences together:

<p>
  <span>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></span>
</p>

Or just wrap each part in its own element:

<p>
  <span>This program is </span>
  <a href="beta.html">
    <span>not stable yet.</span>
    <span>Do not use this in production yet.</span>
  </a>
</p>

There could be a parameter to specify what it should do. I'm just not sure how to figure out when an impossible cut is about to happen, and how to recover from it.

Another issue comes when I have whitespace inside a child element like this:

<p>This is a <b>sentence. </b></p>

Technically, the regex match would end right after the period, before the end of the <b> tag. However, it would be much better to consider the space as part of the match and wrap it like this:

<p><span>This is a <b>sentence. </b></span></p>

Than this:

<p><span>This is a </span><b><span>sentence.</span> </b></p>

But that's a minor issue. After all, I could just allow extra white-space to be included within the regex.

I know this might sound like a "do it for me" question and its not the kind of quick question we see on SO on a daily basis, but I've been stuck on this for a while and it's for an open-source library I'm working on. Solving this problem is the last obstacle. If you think another SE site is best suited for this question, redirect me please.

Domino
  • 6,314
  • 1
  • 32
  • 58
  • Are you sure you want your question to be *that* long? – Amit Jul 07 '15 at 17:32
  • 12
    "*I know this might sound like a "do it for me" question*" - not really, you've explained the problem, provided context and demonstrated prior attempts. It's a pretty good question and, @Amit, the length seems fine to me. – David Thomas Jul 07 '15 at 17:33
  • After @DavidThomas 's comment I took the time to fully read the question. I still think it's far too long but at least it's not "low quality boredom". Why not share a link for your open source library? Also, can you show a fiddle with 2 examples - 1 that works and 1 that doesn't? – Amit Jul 07 '15 at 18:26
  • 3
    [This answer](http://stackoverflow.com/a/1758162/1907186) fundamentally addresses the problem with applying regex to HTML. If I were in your shoes I would probably write or use a parser that can handle your use case. – Lucas Ross Jul 07 '15 at 19:30
  • @LucasRoss some of the answers in that question are comedy gold. thank you for sharing. :D – Woodrow Barlow Jul 07 '15 at 20:36
  • 1
    @Amit I'm working on it, but I don't have enough to put it online yet. The objective is to be able to specify any regex and apply a class/attributes to each match. This could be used to animate text, make each letter of a different color, color a code, parse markdown, etc. – Domino Jul 08 '15 at 23:46
  • 6
    Why does everyone who sees [regex] and [html] tags used together immediately think the question is about parsing HTML with regex? The OP clearly wants to match the *text* contained in the HTML against a regex and highlight the occurrences while leaving the other tags intact. Doesn't look like using regex to *parse* HTML to me. – Lucas Trzesniewski Jul 12 '15 at 14:00
  • 2
    You might want to take [this answer](http://stackoverflow.com/q/10416643/1048572) as a start, though it currently doesn't provide matching across nodes. – Bergi Jul 12 '15 at 18:16
  • @DavidThomas Thanks for the bounty, it attracted people with clever ideas. Can't wait to be back home to try things out. – Domino Jul 14 '15 at 17:46
  • I'm glad to have helped, however selfish my motives may have been :) – David Thomas Jul 15 '15 at 08:01

5 Answers5

43

Here are two ways to deal with this.

I don't know if the following will exactly match your needs. It's a simple enough solution to the problem, but at least it doesn't use RegEx to manipulate HTML tags. It performs pattern matching against the raw text and then uses the DOM to manipulate the content.


First approach

This approach creates only one <span> tag per match, leveraging some less common browser APIs.
(See the main problem of this approach below the demo, and if not sure, use the second approach).

The Range class represents a text fragment. It has a surroundContents function that lets you wrap a range in an element. Except it has a caveat:

This method is nearly equivalent to newNode.appendChild(range.extractContents()); range.insertNode(newNode). After surrounding, the boundary points of the range include newNode.

An exception will be thrown, however, if the Range splits a non-Text node with only one of its boundary points. That is, unlike the alternative above, if there are partially selected nodes, they will not be cloned and instead the operation will fail.

Well, the workaround is provided in the MDN, so all's good.

So here's an algorithm:

  • Make a list of Text nodes and keep their start indices in the text
  • Concatenate these nodes' values to get the text
  • Find matches over the text, and for each match:

    • Find the start and end nodes of the match, comparing the the nodes' start indices to the match position
    • Create a Range over the match
    • Let the browser do the dirty work using the trick above
    • Rebuild the node list since the last action changed the DOM

Here's my implementation with a demo:

function highlight(element, regex) {
    var document = element.ownerDocument;
    
    var getNodes = function() {
        var nodes = [],
            offset = 0,
            node,
            nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
            
        while (node = nodeIterator.nextNode()) {
            nodes.push({
                textNode: node,
                start: offset,
                length: node.nodeValue.length
            });
            offset += node.nodeValue.length
        }
        return nodes;
    }
    
    var nodes = getNodes(nodes);
    if (!nodes.length)
        return;
    
    var text = "";
    for (var i = 0; i < nodes.length; ++i)
        text += nodes[i].textNode.nodeValue;

    var match;
    while (match = regex.exec(text)) {
        // Prevent empty matches causing infinite loops        
        if (!match[0].length)
        {
            regex.lastIndex++;
            continue;
        }
        
        // Find the start and end text node
        var startNode = null, endNode = null;
        for (i = 0; i < nodes.length; ++i) {
            var node = nodes[i];
            
            if (node.start + node.length <= match.index)
                continue;
            
            if (!startNode)
                startNode = node;
            
            if (node.start + node.length >= match.index + match[0].length)
            {
                endNode = node;
                break;
            }
        }
        
        var range = document.createRange();
        range.setStart(startNode.textNode, match.index - startNode.start);
        range.setEnd(endNode.textNode, match.index + match[0].length - endNode.start);
        
        var spanNode = document.createElement("span");
        spanNode.className = "highlight";

        spanNode.appendChild(range.extractContents());
        range.insertNode(spanNode);
        
        nodes = getNodes();
    }
}

// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
    testDiv.innerHTML = originalHtml;
    try {
        var regex = new RegExp(document.getElementById("regex").value, "g");
        highlight(testDiv, regex);
    }
    catch(e) {
        testDiv.innerText = e;
    }
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
  background-color: yellow;
  border: 1px solid orange;
  border-radius: 5px;
}

.section {
  border: 1px solid gray;
  padding: 10px;
  margin: 10px;
}
<form class="section">
  RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>

<div id="test-cases" class="section">
  <div>foo bar baz</div>
  <p>
    <b>HTML</b> is a language used to make <b>websites.</b>
 It was developed by <i>CERN</i> employees in the early 90s.
  <p>
  <p>
    This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a>
  </p>
  <div>foo bar baz</div>
</div>

Ok, that was the lazy approach which, unfortunately doesn't work for some cases. It works well if you only highlight across inline elements, but breaks when there are block elements along the way because of the following property of the extractContents function:

Partially selected nodes are cloned to include the parent tags necessary to make the document fragment valid.

That's bad. It'll just duplicate block-level nodes. Try the previous demo with the baz\s+HTML regex if you want to see how it breaks.


Second approach

This approach iterates over the matching nodes, creating <span> tags along the way.

The overall algorithm is straightforward as it just wraps each matching node in its own <span>. But this means we have to deal with partially matching text nodes, which requires some more effort.

If a text node matches partially, it's split with the splitText function:

After the split, the current node contains all the content up to the specified offset point, and a newly created node of the same type contains the remaining text. The newly created node is returned to the caller.

function highlight(element, regex) {
    var document = element.ownerDocument;
    
    var nodes = [],
        text = "",
        node,
        nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false);
        
    while (node = nodeIterator.nextNode()) {
        nodes.push({
            textNode: node,
            start: text.length
        });
        text += node.nodeValue
    }
    
    if (!nodes.length)
        return;

    var match;
    while (match = regex.exec(text)) {
        var matchLength = match[0].length;
        
        // Prevent empty matches causing infinite loops        
        if (!matchLength)
        {
            regex.lastIndex++;
            continue;
        }
        
        for (var i = 0; i < nodes.length; ++i) {
            node = nodes[i];
            var nodeLength = node.textNode.nodeValue.length;
            
            // Skip nodes before the match
            if (node.start + nodeLength <= match.index)
                continue;
        
            // Break after the match
            if (node.start >= match.index + matchLength)
                break;
            
            // Split the start node if required
            if (node.start < match.index) {
                nodes.splice(i + 1, 0, {
                    textNode: node.textNode.splitText(match.index - node.start),
                    start: match.index
                });
                continue;
            }
            
            // Split the end node if required
            if (node.start + nodeLength > match.index + matchLength) {
                nodes.splice(i + 1, 0, {
                    textNode: node.textNode.splitText(match.index + matchLength - node.start),
                    start: match.index + matchLength
                });
            }
            
            // Highlight the current node
            var spanNode = document.createElement("span");
            spanNode.className = "highlight";
            
            node.textNode.parentNode.replaceChild(spanNode, node.textNode);
            spanNode.appendChild(node.textNode);
        }
    }
}

// Test code
var testDiv = document.getElementById("test-cases");
var originalHtml = testDiv.innerHTML;
function test() {
    testDiv.innerHTML = originalHtml;
    try {
        var regex = new RegExp(document.getElementById("regex").value, "g");
        highlight(testDiv, regex);
    }
    catch(e) {
        testDiv.innerText = e;
    }
}
document.getElementById("runBtn").onclick = test;
test();
.highlight {
  background-color: yellow;
}

.section {
  border: 1px solid gray;
  padding: 10px;
  margin: 10px;
}
<form class="section">
  RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button>
</form>

<div id="test-cases" class="section">
  <div>foo bar baz</div>
  <p>
    <b>HTML</b> is a language used to make <b>websites.</b>
 It was developed by <i>CERN</i> employees in the early 90s.
  <p>
  <p>
    This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a>
  </p>
  <div>foo bar baz</div>
</div>

This should be good enough for most cases I hope. If you need to minimize the number of <span> tags it can be done by extending this function, but I wanted to keep it simple for now.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • 1
    Thanks a lot! I'm at my workplace and don't have any internet at my new apartment, but as soon as I can I'll be looking at all the little details. I didn't even know the nodeiterator object. (Just to be on the safe side I'll ask: can I use part of this in an open source library released under the MIT license? I will credit you if you want). – Domino Jul 14 '15 at 14:06
  • 1
    @JacqueGoupil sure, go ahead, and leave a comment with a link to your lib when it's available - I'm interested in any little details you can find, as I expect a similar requirement to show up in a project of mine (I'll probably have to reimplement that in C#). – Lucas Trzesniewski Jul 14 '15 at 16:19
  • 1
    Congratulations, and thank you for a creative solution :) – David Thomas Jul 19 '15 at 00:14
  • Hi @Lucas Trzesniewski I'm not a topic starter, can I use a part of this code under MIT license in my project? – Evgeniy Sep 06 '16 at 12:13
  • @User sure, go on. Just put a link to this post in a comment, as the SO licence requires that. – Lucas Trzesniewski Sep 06 '16 at 12:15
8

function parseText( element ){
  var stack = [ element ];
  var group = false;
  var re = /(?!\s|$).*?(\.|$)/;
  while ( stack.length > 0 ){
    var node = stack.shift();
    if ( node.nodeType === Node.TEXT_NODE )
    {
      if ( node.textContent.trim() != "" )
      {
        var match;
        while( node && (match = re.exec( node.textContent )) )
        {
          var start  = group ? 0 : match.index;
          var length = match[0].length + match.index - start;
          if ( start > 0 )
          {
            node = node.splitText( start );
          }
          var wrapper = document.createElement( 'span' );
          var next    = null;
          if ( match[1].length > 0 ){
            if ( node.textContent.length > length )
              next = node.splitText( length );
            group = false;
            wrapper.className = "sentence sentence-end";
          }
          else
          {
            wrapper.className = "sentence";
            group = true;
          }
          var parent  = node.parentNode;
          var sibling = node.nextSibling;
          wrapper.appendChild( node );
          if ( sibling )
            parent.insertBefore( wrapper, sibling );
          else
            parent.appendChild( wrapper );
          node = next;
        }
      }
    }
    else if ( node.nodeType === Node.ELEMENT_NODE || node.nodeType === Node.DOCUMENT_NODE )
    {
      stack.unshift.apply( stack, node.childNodes );
    }
  }
}

parseText( document.body );
.sentence {
  text-decoration: underline wavy red;
}

.sentence-end {
  border-right: 1px solid red;
}
<p>This is a sentence. This is another sentence.</p>
<p>This sentence has <strong>emphasis</strong> inside it.</p>
<p><span>This sentence spans</span><span> two elements.</span></p>
MT0
  • 143,790
  • 11
  • 59
  • 117
  • @MTO I am trying to use this. https://jsfiddle.net/remixosbox72/9pmb0Lch/ it breaks in my example html. which is also very long. I have implemented all of the solutions even last one "flat dom" but couldn't get the processing time into mili seconds. Only your solution here is fast enough. Would you help out? Since I am writing this for my mobile it needs to be faster. – JBaba Jul 05 '20 at 15:23
  • @JBaba "It breaks" is not a constructive statement because it doesn't tell me what behaviour you are expecting and what you are seeing and I then need to guess what the issue is. I suggest you reduce your example to a [MRE] which just contains the minimum code necessary to demonstrate the issue when "it breaks" and then ask a new question with all the details of how "it breaks" and what you expect and what you have tried and reference this question and then post another comment here with a link to that new question (so you can tag me and I'll have a look at it if I have time). – MT0 Jul 05 '20 at 16:30
  • @MTO Sorry for inadequate comment I will do the suggested. – JBaba Jul 05 '20 at 16:48
  • 1
    @MTO I wanted to let you know that after fixing complex html use cases I wasn't able to scale this approach to 1 million words and still process it in mills. I came up with fourth approach by just using ranges (Mark.js). – JBaba Jul 12 '20 at 20:45
  • Bonus points for not relying on `innerHTML`! ✨ (which would break event listeners) – gnucchi Sep 17 '22 at 19:02
5

I would use "flat DOM" representation for such task.

In flat DOM this paragraph

<p>abc <a href="beta.html">def. ghij.</p>

will be represented by two vectors:

chars: "abc def. ghij.",
props:  ....aaaaaaaaaa, 

You will use normal regexp on chars to mark span areas on props vector:

chars: "abc def. ghij."
props:  ssssaaaaaaaaaa  
            ssss sssss

I am using schematic representation here, it's real structure is an array of arrays:

props: [
  [s],
  [s],
  [s],
  [s],
  [a,s],
  [a,s],
  ...
]

conversion tree-DOM <-> flat-DOM can use simple state automata.

At the end you will convert flat DOM to tree DOM that will look like:

<p><s>abc </s><a href="beta.html"><s>def.</s> <s>ghij.</s></p>

Just in case: I am using this approach in my HTML WYSIWYG editors.

c-smile
  • 26,734
  • 7
  • 59
  • 86
5

As everyone has already said, this is more of an academic question since this shouldn't really be the way you do it. That being said, it seemed like fun so here's one approach.

EDIT: I think I got the gist of it now.

function myReplace(str) {
  myRegexp = /((^<[^>*]>)+|([^<>\.]*|(<[^\/>]*>[^<>\.]+<\/[^>]*>)+)*[^<>\.]*\.\s*|<[^>]*>|[^\.<>]+\.*\s*)/g; 
  arr = str.match(myRegexp);
  var out = "";
  for (i in arr) {
var node = arr[i];
if (node.indexOf("<")===0) out += node;
else out += "<span>"+node+"</span>"; // Here is where you would run whichever 
                                     // regex you want to match by
  }
  document.write(out.replace(/</g, "&lt;").replace(/>/g, "&gt;")+"<br>");
  console.log(out);
}

myReplace('<p>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></p>');
myReplace('<p>This is a <b>sentence. </b></p>');
myReplace('<p>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</p>');
myReplace('<p>This is a <b>a sentence</b>. Followed <i>by</i> another one.</p>');
myReplace('<p>This is a <b>an even</b> more <i>complex sentence. </i></p>');

/* Will output:
<p><span>This program is </span><a href="beta.html"><span>not stable yet. </span><span>Do not use this in production yet.</span></a></p>
<p><span>This is a </span><b><span>sentence. </span></b></p>
<p><span>This is a <b>another</b> and <i>more complex</i> even <b>super complex</b> sentence.</span></p>
<p><span>This is a <b>a sentence</b>. </span><span>Followed <i>by</i> another one.</span></p>
<p><span>This is a </span><b><span>an even</span></b><span> more </span><i><span>complex sentence. </span></i></p>
*/
Jan
  • 5,688
  • 3
  • 27
  • 44
  • Re-reading the text I might not have understood the full requirement. If there's something I've missed, please advise. – Jan Jul 12 '15 at 01:39
  • I think using the JavaScript DOM interface will be much more practical than parsing HTML tags with regex, but I'm impressed that you got an alright result with that little code. – Domino Jul 15 '15 at 14:05
  • 2
    Like I said, this answer is purely academical. Could it be done using a regex? Pretty much. Is it likely to have unforeseen sideeffects? Probably. Should you actually be using any other method? Most definitely. Was it a fun little puzzle to solve? You bet. – Jan Jul 15 '15 at 15:21
0

I have spent a long time implementing all of approaches given in this thread.

  1. Node iterator
  2. Html parsing
  3. Flat Dom

For any of this approaches you have to come up with technique to split entire html into sentences and wrap into span (some might want words in span). As soon as we do this we will run into performance issues (I should say beginner like me will run into performance issues).

Performance Bottleneck

I couldn't scale any of this approach to 70k - 200k words and still do it in milli seconds. Wrapping time keeps increasing as words in pages keep increasing.

With complex html pages with combinations of text-node and different elements we soon run into trouble and with this technical debt keeps increasing.

Best approach : Mark.js (according to me)

Note: if you do this right you can process any number of words in millis.

Just use Ranges I want to recommend Mark.js and following example,

var instance = new Mark(document.body);
instance.markRanges([{
    start: 15,
    length: 5
}, {
    start: 25:
    length: 8
}]); /

With this we can treat entire body.textContent as string and just keep highlighting substring.

No DOM structure is modified here. And you can easily fix complex use cases and technical debt doesn't increase with more if and else.

Additionally once text is highlighted with html5 mark tag you can post process these tags to find out bounding rectangles.

Also look into Splitting.js if you just want split html documents into words/chars/lines and many more... But one draw back for this approach is that Splitting.js collapses additional spaces in the document so we loose little bit of info.

Thanks.

JBaba
  • 590
  • 10
  • 30