4

Let me start by saying that I know you should never parse HTML with regex. I'm not, I just have a corner-case where the comments are finding their way into my content, and unfortunately I can't change that.

I have been wracking my brain trying to come up with a regex pattern for .NET that will match anything that is not part of an HTML comment. For example:

foo<!--abc-->bar

Should match "foo" and "bar".

foobar

Should match "foobar" (there is no comment, so match everything).

<!--foo-->

Should not produce any matches because there is nothing that's not in the comment.

I can match comments easily enough with the regex <!--.*?-->, but with my program specifications, I can't simply strip them out in this case, I need to match whatever is not in a comment. Every way I've been able to think of or find online to try to exclude the comments ends up either selecting everything all together (because then the start and end of the comments aren't the start and end of the match), or finds undesired matches. For example:

foo<!--abc-->bar

Using the regex ((?!<!--.*?-->).)* (simply negating the regex for finding comments by using a negative lookahead), I get 4 matches: the first is the correctly matched "foo", but then the second and fourth matches show as blank strings (I'm not sure why), and the third match is "!--abc-->bar", because simply dropping the "<" technically satisfies the condition. Making the last * quantifier lazy seems to make it even worse, returning 17 blank string matches. I've tried a few other approaches, like using negative lookarounds to exclude the comments, but they've all fallen prey to similar problems that I'm not sure how to solve.

I also tried the regex from this question's accepted answer: Regex to strip anything that isn't an html comment; but unfortunately this includes the <!-- and --> parts of the comment in the matches, and if I'm reading it right, I don't think it will match a string that has no comment in it. I attempted to modify this to solve these issues for my use case, but haven't had any success with that...

EDIT

After taking a step back from the problem and re-thinking my needs, I realized that I don't actually need to match all text that's not part of a comment. I really just need to know if there is any non-whitespace text that isn't part of a comment, anywhere in the content, using the Regex.IsMatch method with the SingleLine option. For that purpose, the following regex should do the trick:

(?!^(\s*<!--([^-]*|-[^-]*|--[^>]*)-->\s*)+$)^.*\S.*$

Since this drastically changes the question and immediately answers it, I'm not exactly sure what the correct protocol is now... But unless something better is proposed, I suppose I'll leave the question open for a few days in case anyone happens to find a fault in my regex, and if no one does I'll just self-answer and close the question.

Community
  • 1
  • 1
jdawkins
  • 180
  • 1
  • 8
  • Well, it would be easier if you could split with `(?s)` – Wiktor Stribiżew Feb 07 '17 at 19:06
  • If you implement PCRE.NET, you will be able to use `(*SKIP)(*F)`. With .NET native regex, you cannot do that. – Wiktor Stribiżew Feb 07 '17 at 19:20
  • If there are no tags in your input how about something like [`[^>]+(?=<!|$)`](http://www.regexstorm.net/tester?p=%5b%5e%3e%5d%2b%28%3f%3d%3c!%7c%24%29&i=foo%3c!--abc--%3ebar%0d%0afoobar%0d%0a%3c!--foo--%3e%0d%0afoo%3c!--abc--%3ebar) or if there are, something like [`(?:[^>]|(? – bobble bubble Feb 07 '17 at 22:50
  • @bobblebubble Those do work pretty well, though there are a few corner cases where they have trouble (obviously you can't use the first one if ">" symbols might appear in the input, tags or no, and both have trouble with certain cases that look like part of a comment but may not be), but most uses probably won't run into those issues often if ever. This may be worth making into an answer, though I'm not sure how to handle it after my edit to the question... – jdawkins Feb 08 '17 at 15:29
  • @jdawkins those are of poor performance. Probably your best bet would be to [match what you don't want, but capture what you need](http://www.rexegg.com/regex-best-trick.html#thetrick). See [captures](http://www.regular-expressions.info/brackets.html) of first group for [`(?s)|((?:(?! – bobble bubble Feb 08 '17 at 21:09
  • @bobblebubble That's an awesome trick that does solve the original question, and somehow I've never managed to come across it before. Post it as an answer and I'd say the points are yours. – jdawkins Feb 13 '17 at 21:11

1 Answers1

2

If matching and cpaturing gets complicated, in some cases a simple "trick" can help:
Match what you don't want (on left side of an alternation) | or capture what you need.

What you don't want are the comments: <!--.*?-->

Or capture any character, that's not starting an opening a comment: |((?:(?!<!--).)+)
(the lookahead prevents skipping over <!--) and grab captures of first capturing group.

(?s)<!--.*?-->|((?:(?!<!--).)+)

Used (?s) for single-line mode (dot also matches newlines). See this demo at regexstorm.

In PCRE regex it could be done without capturing groups by use of (*SKIP)(*F) verbs (demo).

Community
  • 1
  • 1
bobble bubble
  • 16,888
  • 3
  • 27
  • 46