What are reliable approaches for finding words/text within an HTML-markup's text-content and replacing the matches with highlighting markup?

Question

I have some text. And I have a function that receives a word or phrase and I have to return the same text but with a span with a class around this keyword or phrase.

Example:

If I have this

text = <a href="/redirect?uri=https%3A%2F%2Fwww.website.com&context=post" target="_blank" rel="noopener noreferrer">https://www.website.com</a>

I want

text = <a href="/redirect?uri=https%3A%2F%2Fwww.website.com&context=post" target="_blank" rel="noopener noreferrer">https://www.<span class="bold">website</span>.com</a>

but what I'm getting is

text = <a href="/redirect?uri=https%3A%2F%2Fwww.<span class="bold"> website </span>.com&amp;context=post" target="_blank" rel="noopener noreferrer">https://www.<span class="bold"> website </span>.com</a>

What I'm doing is

        ...
        const escapedPhrases = ["\\bwebsite\\b"]
        const regex = new RegExp(`(${escapedPhrases.join('|')})`, 'gi');
        text = text.replace(
          regex,
          '<span class="bold"> $1 </span>'
        );

How can I improve my regex?

Also I have tried to "clean" the text after the replacement of <span class="bold"> $1 </span> to try of remove it if it's inside the href but with no success.

UPDATE for clarification:

I have this text:

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com</a>

Thanks!`

Example 1: I want to highlight the word twitter:

For this I want to add a span with class bold for example around twitter:

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.<span class="bold">twitter</span>.com</a>

Thanks!`

Example 2: I want to highlight the word twitter.com:

For this I want to add a span with class bold for example around twitter.com:

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.<span class="bold">twitter.com</span></a>

Thanks!`

Example 3: I want to highlight the word https://twitter.com/:

For this I want to add a span with class bold for example around https://twitter.com/:

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer"><span class="bold">https://www.twitter.com</span></a>

Thanks!`

Example 4:

I have this text and want to highlight twitter:

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com</a>

Thanks for follow my twitter!`

Then I have to return

text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.<span class="bold">twitter</span>.com</a>

Thanks for follow my <span class="bold">twitter</span>!`

@WiktorStribiżew That's correct! my bad, I updated my post! — sara lance, Apr 16 '21 at 20:22
My input is a string that can have a lot of words and in the middle a link like the one of the example. I have to search a key word in this string. In most cases it works okay but if the key word happens to be a word that is in the link, like `website` or `website.com` or `https://www.website.com/` in this example, the link breaks. — sara lance, Apr 16 '21 at 21:55
So you just need to replace `website` keyword inside `a` tags ? if you find `website` outside, you don't touch it ? Please provide and more complete sample to a better understanding — thibsc, Apr 16 '21 at 22:04
No, the only place I don't want to touch it is inside the href of the link. All the other places I want to add a span around the word, example, ` website ` or ` website.com ` or ` https://www.website.com/ `, depends of the keyword I receive. — sara lance, Apr 16 '21 at 22:09
Seriously, regex is **not** the right tool for the entirety of this job. If need to parse HTML markup, use a tool that can parse the markup. Regular expressions are proved to be unable to do this. Got a browser to hand? It has [good capabilities for parsing markup](https://stackoverflow.com/a/10585079/14357). Using node? [jsdom](https://www.npmjs.com/package/jsdom) will make easy work of the markup. When you've separated your elements from your text nodes, that's the time to consider regex. Regex isn't an HTML parser. Use the right tools for the job and save yourself from the brittle solution — spender, Apr 16 '21 at 22:28
Take a read over this [ancient stackoverflow post](https://stackoverflow.com/q/1732348/14357) if I haven't managed to convince you that you're moving in the wrong direction. — spender, Apr 16 '21 at 22:40
@saralance ... are there any questions regarding the approaches of the so far two beneath answers? — Peter Seliger, Apr 20 '21 at 11:03

thibsc · Accepted Answer · 2021-04-16T23:43:15.047

Regex is not a solution to everything, in that case, to only modifying the textContent and not the attribute maybe this following code will fit your needs:

let text = `Follow me on 
<a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com</a>

Thanks for follow my twitter!`;

const replaceKeyword = (keyword, text) => {
  let template = document.createElement('template');
  template.innerHTML = text;
  let children = template.content.childNodes;
  
  let str = '';
  let substitute = `<span style='color:red;font-weight:bold;'>${keyword}</span>`;
  for (let child of children){
    if (child.nodeType === 3){
      // #text
      str += child.textContent.replace(keyword, substitute);
    } else if (child.nodeType === 1) {
      // element
      let nodeStr = child.textContent.replace(keyword, substitute);
      child.innerHTML = nodeStr;
      str += child.outerHTML;
    }
  }
  return str;
}

let result = replaceKeyword('twitter', text);
console.log(result);
document.body.innerHTML = result;

Peter Seliger · Answer 2 · 2021-04-17T11:05:32.583

With the latest features which got added to the requirements, the OP entirely changed the game. One now is talking about a full-text-search within the text-contents of html-markup.

Something similar to ...

How to highlight the search-result of a text-query within an html document ignoring the html tags?
Markdown-like functionality for tooltips ... or ... How to query text-nodes from DOM, find markdown-patterns, replace matches with HTML-markup and replace the original text-node with the new content?
What is a good enough approach for writing real-time text search and highlight functionality which does not break the order of text- and element-nodes

... with the last two one providing different but generic DOM-node/text-node based approaches.

As for the OP's problem. With requirements like finding a text-query within the text-content of html-code, one can not stick to a simple solution. One now has to assume nested markup.

Providing/adding a special markup around each search result has to start with firstly collecting every single text-node from the very DOM-fragment which had to be parsed before from the passed html-code.

Having such a base, one can not anymore just fire around with a regex based String.replace. One now has to replace/reassamble each text-node that partially matches the search-query with the text-contents which did not match and the part that now changes into an element-node due to the additional markup which gets wrapped around the matching text.

Thus just from the OP's last requirement change, one has to provide a generic full text search and highlight approach which of cause in addition has to take into account and to sanitize/handle white-space sequences and regex-specific characters within the provided search query ...

// node detection helpers.
function isElementNode(node) {
  return (node && (node.nodeType === 1));
}
function isNonEmptyTextNode(node) {
  return (
        node
    && (node.nodeType === 3)
    && (node.nodeValue.trim() !== '')
    && (node.parentNode.tagName.toLowerCase() !== 'script')
  );
}

// dom node render helper.
function insertNodeAfter(node, referenceNode) {
  const { parentNode, nextSibling } = referenceNode;
    if (nextSibling !== null) {

    node = parentNode.insertBefore(node, nextSibling);
  } else {
    node = parentNode.appendChild(node);
  }
  return node;
}

// text node reducer functionality.
function collectNonEmptyTextNode(list, node) {
  if (isNonEmptyTextNode(node)) {
    list.push(node);
  }
  return list;
}
function collectTextNodeList(list, elmNode) {
  return Array.from(
    elmNode.childNodes
  ).reduce(
    collectNonEmptyTextNode,
    list
  );
}
function getTextNodeList(rootNode) {
  rootNode = (isElementNode(rootNode) && rootNode) || document.body;

  const elementNodeList = Array.from(
    rootNode.getElementsByTagName('*')
  );
  elementNodeList.unshift(rootNode);

  return elementNodeList.reduce(collectTextNodeList, []);
}


// search result emphasizing functinality.

function createSearchMatch(text) {
  const elmMatch = document.createElement('strong');

  // elmMatch.classList.add("bold");
  elmMatch.textContent = text;

  return elmMatch;
}
function aggregateSearchResult(collector, text, idx) {
  const { previousNode, regXSearch } = collector;

  const currentNode = regXSearch.test(text)
    ? createSearchMatch(text)
    : document.createTextNode(text);

  if (idx === 0) {
    previousNode.parentNode.replaceChild(currentNode, previousNode);
  } else {
    insertNodeAfter(currentNode, previousNode);
  }
  collector.previousNode = currentNode;

  return collector;
}
function emphasizeTextContentMatch(textNode, regXSearch) {
  // console.log(regXSearch);
  textNode.textContent
    .split(regXSearch)
    .filter(text => text !== '')
    .reduce(aggregateSearchResult, {
      previousNode: textNode,
      regXSearch,
    })
}


function emphasizeEveryTextContentMatch(htmlCode, searchValue, isIgnoreCase) {
  searchValue = searchValue.trim();
  if (searchValue !== '') {

    const replacementNode = document.createElement('div');
    replacementNode.innerHTML = htmlCode;

    const regXSearchString = searchValue
      // escaping of regex specific characters.
      .replace((/[.*+?^${}()|[\]\\]/g), '\\$&')
      // additional escaping of whitespace (sequences).
      .replace((/\s+/g), '\\s+');

    const regXFlags = `g${ !!isIgnoreCase ? 'i' : '' }`;
    const regXSearch = RegExp(`(${ regXSearchString })`, regXFlags);

    getTextNodeList(replacementNode).forEach(textNode =>
      emphasizeTextContentMatch(textNode, regXSearch)
    );
    htmlCode = replacementNode.innerHTML
  }
  return htmlCode;
}


const htmlLinkList = [
  emphasizeEveryTextContentMatch(
    'Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a> Thanks!',
    'twitter'
  ),
  emphasizeEveryTextContentMatch(
    'Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a> Thanks!',
    'twitter.com'
  ),
  emphasizeEveryTextContentMatch(
    'Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a> Thanks!',
    'https://www.twitter.com/'
  ),
  emphasizeEveryTextContentMatch(
    'Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a> Thanks for follow my Twitter!',
    'TWITTER',
    true
  ),
  emphasizeEveryTextContentMatch(
    `Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a>
    Thanks
    for follow 
    my   Twitter!`,
    'follow my twitter',
    true
  ),
];
document.body.innerHTML = htmlLinkList.join('<br/>');

const container = document.createElement('code');

container.textContent = emphasizeEveryTextContentMatch(
  'Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a> Thanks for follow my Twitter!',
  'TWITTER',
  true
);
document.body.appendChild(container.cloneNode(true));

container.textContent = emphasizeEveryTextContentMatch(
  `Follow me on <a href="/redirect?uri=https%3A%2F%2Fwww.twitter.com&context=post" target="_blank" rel="noopener noreferrer">https://www.twitter.com/</a>
  Thanks
  for follow 
  my   Twitter!`,
  'follow my twitter',
  true
);
document.body.appendChild(container.cloneNode(true));

code {
  display: block;
  margin: 10px 0;
  padding: 0
}
a strong {
  font-weight: bold;
}
.as-console-wrapper { min-height: 100%!important; top: 0; }

What are reliable approaches for finding words/text within an HTML-markup's text-content and replacing the matches with highlighting markup?

2 Answers2