Let me start by saying that I know you should never parse HTML with regex. I'm not, I just have a corner-case where the comments are finding their way into my content, and unfortunately I can't change that.
I have been wracking my brain trying to come up with a regex pattern for .NET that will match anything that is not part of an HTML comment. For example:
foo<!--abc-->bar
Should match "foo" and "bar".
foobar
Should match "foobar" (there is no comment, so match everything).
<!--foo-->
Should not produce any matches because there is nothing that's not in the comment.
I can match comments easily enough with the regex <!--.*?-->
, but with my program specifications, I can't simply strip them out in this case, I need to match whatever is not in a comment. Every way I've been able to think of or find online to try to exclude the comments ends up either selecting everything all together (because then the start and end of the comments aren't the start and end of the match), or finds undesired matches. For example:
foo<!--abc-->bar
Using the regex ((?!<!--.*?-->).)*
(simply negating the regex for finding comments by using a negative lookahead), I get 4 matches: the first is the correctly matched "foo", but then the second and fourth matches show as blank strings (I'm not sure why), and the third match is "!--abc-->bar", because simply dropping the "<" technically satisfies the condition. Making the last * quantifier lazy seems to make it even worse, returning 17 blank string matches. I've tried a few other approaches, like using negative lookarounds to exclude the comments, but they've all fallen prey to similar problems that I'm not sure how to solve.
I also tried the regex from this question's accepted answer: Regex to strip anything that isn't an html comment; but unfortunately this includes the <!--
and -->
parts of the comment in the matches, and if I'm reading it right, I don't think it will match a string that has no comment in it. I attempted to modify this to solve these issues for my use case, but haven't had any success with that...
EDIT
After taking a step back from the problem and re-thinking my needs, I realized that I don't actually need to match all text that's not part of a comment. I really just need to know if there is any non-whitespace text that isn't part of a comment, anywhere in the content, using the Regex.IsMatch method with the SingleLine option. For that purpose, the following regex should do the trick:
(?!^(\s*<!--([^-]*|-[^-]*|--[^>]*)-->\s*)+$)^.*\S.*$
Since this drastically changes the question and immediately answers it, I'm not exactly sure what the correct protocol is now... But unless something better is proposed, I suppose I'll leave the question open for a few days in case anyone happens to find a fault in my regex, and if no one does I'll just self-answer and close the question.