Almost always, when trying to parse HTML, Regex
is not the answer. Regex
is, as its name suggestions, for parsing regular text. HTML is not regular text.
Have a read of this answer:
RegEx match open tags except XHTML self-contained tags
Instead you are so much better off using a tool that is designed for working with HTML. I'd suggest using "HtmlAgilityPack" (which you can NuGet by that name).
Here's how you could make it work.
First I'm going to create a simple function to anonymize text:
Func<string, string> anonymize =
t => new String(
t
.ToCharArray()
.Select(x =>
Char.IsSeparator(x)
? x
: (Char.IsUpper(x) ? 'X' : 'x'))
.ToArray());
Now I can use HtmlAgilityPack to do all of the rest of the heavy lifting:
var html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(@"<html>
<!-- comments here -->
<body>
<p>Some random text</p>
</body>
</html>");
var textNodes =
html
.DocumentNode
.Descendants()
.OfType<HtmlAgilityPack.HtmlTextNode>()
.Where(x => !String.IsNullOrWhiteSpace(x.Text))
.ToArray();
foreach (var textNode in textNodes)
{
textNode.Text = anonymize(textNode.Text);
}
var output = html.DocumentNode.OuterHtml;
The output I get is:
<html>
<!-- comments here -->
<body>
<p>Xxxx xxxxxx xxxx</p>
</body>
</html>