Regex excluding matches contained within a HTML tag

Question

I'm trying to create a Regex expression to match content within a HTML document, but I wish to exclude matches contained within a tag itself. Consider the following:

<p>Here is some sample text for my widgets</p>
<a href="http://mywidgets.nowhere">Click here to view my widgets</a>

I would like to match 'widgets' so that I can replace it with a different string, say 'green box', without replacing the match within the url.

Matching 'widgets' is, well, easy as anything, but I'm struggling to add the exclude to check for 'widgets' when it appears within the opening and closing tag '<>'.

My current workings: As a first step I have started to match 'widgets' contained within '<>'. (I can then move on to make this an exclude later) However the below string seems to match the whole document, even though I have placed an exclude on the closing > to make sure widgets appears within a tag.

<.*[^>]widgets.*[^<]>+

It's probably down to lazy / greedy, but I can't quite work it out!

[H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) - use a parser. Also what language? Because it's easy to do this in .net since it allows variable width lookbehinds: [`(?<!<[^>]*)widgets`](http://regexstorm.net/tester?p=%28%3f%3c!%3c%5b%5e%3e%5d*%29widgets&i=%3cp%3eHere+is+some+sample+text+for+my+widgets%3c%2fp%3e%0d%0a%3ca+href%3d%22http%3a%2f%2fmywidgets.nowhere%22%3eClick+here+to+view+my+widgets%3c%2fa%3e)) — ctwheels, Jan 12 '18 at 14:14
@ctwheels I'm using c# .net and that regex works too, cheers! — Radderz, Jan 12 '18 at 14:27
Well that's super lucky haha I'll post as an answer. That was a total shot in the dark. — ctwheels, Jan 12 '18 at 14:29
@ctwheels wow, I didn't know there was a language that did allow them. My first thought on reading the problem was actually "well, obviously not look behind because we don't know the length" :) — Eily, Jan 12 '18 at 14:34
@Eily you can use variable length lookbehinds in [tag:.net] and [tag:JGsoft]. [tag:Java] also *somewhat* allows them, but you can't use `*` or `+` (so you can do `(?<!.{x, y})` — ctwheels, Jan 12 '18 at 14:36
@ctwheels Those are languages I never used, or didn't use in a while in the case of Java. So still no variable length lookbehinds for me :D — Eily, Jan 12 '18 at 14:38
[You should probably not be using regular expressions](http://www.htmlparsing.com/regexes.html) — Andy Lester, Jan 12 '18 at 15:27

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Overview

By no means is this a great answer since it's parsing HTML with regex, but it does work for the test case given by the OP.

See RegEx match open tags except XHTML self-contained tags for more information.

Code

See regex in use here

(?<!<[^>]*)widgets

Explanation

(?<!<[^>]*) Negative lookbehind ensuring what precedes is not < followed by any character except > (any number of times)
widgets Match this literally

score 0 · Answer 2 · answered Jan 12 '18 at 14:27

This may partially work:

(?:^|>)[^<]*widgets

This will start looking from the start of a line (if the /m flag is used) or the end of a tag (so we know we are not in one), and advance as many characters possible that are not <, meaning you can't open another tag, before looking for widgets. The issues with this are that it may give weird results if you have a > inside a tag (eg, in javascript), or if a single tag can span over multiple lines and it won't find several instances of "widgets" in the same substring. To solve those issue, you'd better use an actual XML parser as advised by ctwheels

Regex excluding matches contained within a HTML tag

2 Answers2

Overview

Code

Explanation