1

I need a little help with Regular Expressions.

I'm using Javascript and JQuery to hyperlink terms within an HTML document, to do this I'm using the following code. I'm doing this for a number of terms in a massive document.

var searchterm = "Water";

jQuery('#content p').each(function() {

  var content = jQuery(this),
      txt = content.html(),
      found = content.find(searchterm).length,
      regex = new RegExp('(' + searchterm + ')(?![^(<a.*?>).]*?<\/a>)','gi');

  if (found != -1) {
    //hyperlink the search term
    txt = txt.replace(regex, '<a href="/somelink">$1</a>');
    content.html(txt);
  }
});

There are however a number of instances I do not want to match and due to time constraints and brain melt, I'm reaching out for some assistance.


EDIT: I've updated the codepen below based on the excellent example provided by @ggorlen, thank you!

Example https://codepen.io/julian-young/pen/KKwyZMr

Julian Young
  • 872
  • 2
  • 9
  • 21
  • If you're working with HTML, why not just search the DOM or use an HTML parser? – ggorlen Jan 03 '20 at 16:00
  • 2
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/3600709) - don't parse HTML with regex. – ctwheels Jan 03 '20 at 16:01
  • @ggorlen, Do tell me more, I'm actually working with Advanced Custom Fields, PHP and Wordpress. I'm trying to stay away as much as possible from the backend and create a simple javascript based layer. I'm also trying to be as efficient as possible, it's a huge document and I need to match many terms. – Julian Young Jan 03 '20 at 16:02
  • @ctwheels - I know, I know :( – Julian Young Jan 03 '20 at 16:02
  • Well, the whole point of jQuery (and JS, for that matter) is that it searches and manipulates the DOM for you. If you dump it to `.html()` and use regex, this is like buying a bike and [carrying it instead of riding it](https://www.youtube.com/watch?v=peKQ33hCZ1U). – ggorlen Jan 03 '20 at 16:04
  • @ggorlen can you point me to any good examples of what you mean? I've done quite a lot of searching for Jquery text searching and they all seemed to point back to regex. Do you mean like manually parsing the text without regex? – Julian Young Jan 03 '20 at 16:06
  • Sure, I added an answer. Let me know how it works for you. – ggorlen Jan 03 '20 at 16:34
  • What are you trying to match with `(?![^().]*?<\/a>)`? – Toto Jan 03 '20 at 17:33

1 Answers1

2

Dumping the entire DOM to raw text and parsing it with regex circumvents the primary purpose of jQuery (and JS, by extension), which is to traverse and manipulate the DOM as an abstract tree of nodes.

Text nodes have a nodeType Node.TEXT_NODE which we can use in a traversal to identify the non-link nodes you're interested in.

After obtaining a text node, regex can be applied appropriately (parsing text, not HTML). I used <mark> for demonstration purposes, but you can make this an anchor tag or whatever you need.

jQuery gives you a replaceWith method that replaces the content of a node after you've made the desired regex substitution.

$('#content li').contents().each(function () {
  if (this.nodeType === Node.TEXT_NODE) {    
    var pattern = /(\b[Ww]aters?(?!-)\b)/g;
    var replacement = '<mark>$1</mark>';
    $(this).replaceWith(this.nodeValue.replace(pattern, replacement));
  }
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<h1>Example Content</h1>
<div id="content">
  <ul>
    <li>Water is a fascinating subject. - <strong>match</strong></li>
    <li>We all love water. - <strong>match</strong></li>
    <li>ice; water; steam - <strong>match</strong></li>
    <li>The beautiful waters of the world - <strong>match</strong> (including the s)</li>
    <li>and all other water-related subjects - <strong>no match</strong></li>
    <li>and this watery topic of - <strong>no match</strong></li>
    <li>of WaterStewardship looks at how best - <strong>no match</strong></li>
    <li>On the topic of <a href="/governance">water governance</a> - <strong>no match</strong></li>
    <li>and other <a href="/water">water</a> related things - <strong>no match</strong></li>
    <li>the best of <a href="/allthingswater">all things water</a> - <strong>no match</strong></li>
  </ul>
</div>

You can do it without jQ and apply to everything in the document:

for (const parent of document.querySelectorAll("body *:not(a)")) {
  for (const child of parent.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      const pattern = /(\b[Ww]aters?(?!-)\b)/g;
      const replacement = "<mark>$1</mark>";
      const subNode = document.createElement("span");
      subNode.innerHTML = child.textContent.replace(pattern, replacement);
      parent.insertBefore(subNode, child);
      parent.removeChild(child);
    }    
  }
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div>
  hello water
  <div>
    <div>
      I love Water.
      <a href="">more water</a>
    </div>
    watership down
    <h4>watery water</h4>
    <p>
      waters
    </p>
    foobar <a href="">water</a> water
  </div>
</div>
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Looks very good, so happy to have discovered nodeTypes! Thank you. This makes a lot of sense, I didn't realise each dom element could be broken down in such a way. Steering away from this -> https://regex101.com/r/3upvk3/3 – Julian Young Jan 03 '20 at 16:50
  • No problem, although what you mentioned about WP and Advanced Custom Fields suggests that even this may [not be the best way to accomplish whatever it is you're really trying to achieve](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/233676#233676). – ggorlen Jan 03 '20 at 16:54
  • 1
    all now up and running perfectly and dynamically! Many, many thanks! I learnt a great deal! – Julian Young Jan 03 '20 at 22:43