Regular expression for finding all sub strings which are NOT inside a HTML link element

Question

Assuming I have the following HTML snippet:

Deep Work 
<p>
<a data-href="Deep Work" href="Deep Work" class="internal-link" 
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>

Which regular expression will only match the two "Deep Work" text snippets which are located completely outside of the a-blocks? So, only the ones marked as yellow in this screenshot (not the red ones):

I tried multiple approaches, but always ended up getting a match for the last red one. Which I need to avoid. Thus I would appreciate any help from the community. Thanks!

Update: Unfortunately I simplified the HTML code above too much, using line breaks, to get it readable in StackOverflow. Here is better use case:

<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>

Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.

The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a [`DOMParser`](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) based approach. — Peter Seliger, Mar 05 '22 at 10:59
The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an `` element (the one-liner markup). — Peter Seliger, Mar 05 '22 at 12:53
The most important answer related to this question: https://stackoverflow.com/a/1732454/1243641 — Scott Sauyet, Mar 06 '22 at 00:00
@Fabian ... Regarding all the comments and answers/approaches are there any questions left? — Peter Seliger, Mar 07 '22 at 09:03

Peter Seliger · Accepted Answer · 2022-03-05T12:54:21.300

»Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«

Since the OP's example clearly shows, that the OP wants to match any node value (or text content) from just the first-level text-nodes, a solution based on DOMParser.parseFromString might look similar to the next provided example code ...

const sampleMarkup =
  `Deep Work 1
  <p>
    <a data-href="Deep Work" href="Deep Work" class="internal-link" 
    target="_blank" rel="noopener">Deep Work</a>
  </p>
  Deep Work 2
  <a href="blabla">Some other text</a>`;

console.log(
  Array
    // make an array from ...
    .from(
      (new DOMParser)
        // ... a parsed document ...
        .parseFromString(sampleMarkup, 'text/html')
        // ... body's ...
        .querySelector('body')
        // ... child nodes ...
        .childNodes
    )
    // ... and filter just the first level text nodes in order to ...
    .filter(node => node.nodeType === 3)
    // ... retrieve each matching text node's sanitized/trimmed text content.
    .map(node => node.nodeValue.trim())
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

From the above comments ...

»The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an <a/> element (the one-liner markup).«

... but as already mentioned ...

»The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a DOMParser based approach.«

... due to a reliable approach, the refactoring can be achieved pretty easily ...

// `<p>
//   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
//     Deep Work
//   </a>
//   Deep Work 1
//   <a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">
//     Deep Work
//   </a>
//   Deep Work 2
// </p>`

const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" class="internal-link" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';

function collectTextNodes(textNodeList, node) {
  const nodeType = node?.nodeType;
  if (nodeType === 1) {

    [...node.childNodes]
      .reduce(collectTextNodes, textNodeList);

  } else if (nodeType === 3) {

    textNodeList.push(node);
  }
  return textNodeList;
}

console.log(
  // ... collect any text node from within ...
  collectTextNodes(
    [],
    (new DOMParser)
      // ... a parsed document's ...
      .parseFromString(sampleMarkup, 'text/html')
      // ... body ...
      .querySelector('body')
  )
  // ... and filter any text node which is not located within an `<a/>` element ...
  .filter(textNode =>
    textNode.parentNode.closest('a') === null
  )
  // ... and retrieve each matching text node's sanitized/trimmed text content.
  .map(node =>
    node.nodeValue.trim()
  )
);

.as-console-wrapper { min-height: 100%!important; top: 0; }

Regular expression for finding all sub strings which are NOT inside a HTML link element

1 Answers1

Linked