Struggling with limiting a RegExp negative lookahead

Question

I've got a paragraph whose innerHTML contains text, some of which have words that are anchor links. I want to pick out word string matches that aren't contained within anchor links (enclosed in anchor tags) but I'm struggling with RegExp, my negative lookahead;

example(?!.+\</a>)

isn't stopping the lookahead when it encounters the start of another anchor link i.e <a so all words are seen as being inside anchor tags as eventually there is always a </a>.

How do I have a RexExp negative lookahead look for a </a> but stop when it encounters <a.

https://regex101.com/r/HTOgkG/1

Re the dupetarget that's been picked: In your case, you don't have to worry about the parsing part of it, because it's **already** parsed. — T.J. Crowder, Dec 16 '19 at 12:50

T.J. Crowder · Answer 1 · 2019-12-16T10:47:35.677

Don't use regular expressions to parse HTML, HTML is far too complex for it.

You've said your starting point is a paragraph element. That means you already have a nicely parsed version of what you want to search. Look through the paragraph's descendant child nodes for Text nodes: For each Test node, see if it contains the word/words you're looking for, then look at its parentNode.tagName to see if it's in an a element (perhaps looping through parents to handle the <a href="#xyz"><span>target word</span></a> case).

For example, here my target word is "example":

function findMatches(target, para, element = para) {
    let child = element.firstChild;
    while (child) {
        if (child.nodeType === 3 && child.nodeValue.includes(target)) {
            const a = child.parentNode.closest("a");
            if (!a || !para.contains(a)) {
                console.log(`Found in '${child.nodeValue}'`);
            }
        }
        child = child.nextSibling;
    }
}

findMatches("example", document.getElementById("theParagraph"));

<p id="theParagraph">This example matches, but <a href="#">this example</a> and <a href="#"><span>this example</span></a> don't match.

That example uses ES2015+ features and modern browser features like closest, but can be written in ES5 (and closest can be polyfilled).

Simon · Answer 2 · 2019-12-16T11:19:25.370

0

<\s*a\s*[^<]*?>[^>]*>?<\s*\/a\s*>

the example
It just removes everything between <a></a>, it didn't remove the punctuations.

[Update] Now it will not be stopped by <a or not closed <a> </a>

edited Dec 16 '19 at 11:19

answered Dec 16 '19 at 10:47

Simon

647
4
9

1

...and it fails as soon as you have an attribute on the `a` tag containing a `>` character. Which is why you [don't try to use a simple regex to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454). Every time someone thinks "Yeah, but my example is really well-contained and simple." it ends up breaking. Every. Single. Time. – T.J. Crowder Dec 16 '19 at 10:49
Yes I agree you should not try to use regex to parse HTML, this is correct if the context **IS** HTML. If it's just a piece of text contains limited tags, then maybe regex can help. – Simon Dec 16 '19 at 11:30
No, again, it isn't correct even for valid HTML: `...` fails, but is perfectly valid HTML. – T.J. Crowder Dec 16 '19 at 12:49
It works here, https://regex101.com/r/FDOlET/3, it removed everything you given. – Simon Dec 16 '19 at 12:57
Okay, so *this specific one* isn't susceptible to *that specific way* these usually fail. It's susceptible to this one instead: `Google` and probably several of the other usual way these things fail (including the ways you mention in your update). Again, this isn't in some way a radical notion. You **cannot** reliably deal with HTML with a simple regiex. It's just not reliable. – T.J. Crowder Dec 16 '19 at 13:07
@T.J.Crowder totally agreed, it should not be used to parse HTML, it can only be used to do simple jobs in restricted context. – Simon Dec 16 '19 at 13:13
No, it can't, reliably. Again, this is not a controversial view. – T.J. Crowder Dec 16 '19 at 13:15

Struggling with limiting a RegExp negative lookahead

2 Answers2