How to get all the character contents of an html converted to "x" excluding comments

Question

I am trying to figure out a way through which I can replace all the contents of an html string source into "x" (case sensitive manner). I am able to do it using Regex.Replace() but it converts the tags as well. Also I would like to exclude comments from the conversion. For instance for a string html like:

 <html>
  <!-- comments here -->
  <body>
   <p>Some random text</p>
  </body>
 </html>

The output should be:

 <html>
  <!-- comments here -->
   <body>
    <p>Xxxx xxxxxx xxxx</p>
   </body>
 </html>

Thank you for your help.

No Regex : string input = "Some random text"; string output = string.Join("",input.Select(x => char.IsWhiteSpace(x) ? " " : char.IsUpper(x) ? "X" : "x")); — jdweng, Jul 05 '15 at 00:45
Loop over content char per char and do this: if char is `>` start substitution next char, if char is `<` stop substitution. If substitution is on, see if char is lower or upper and replace for corresponding `x` or `X`, else do nothing. Now get the next char. — ShellFish, Jul 05 '15 at 00:55

score 2 · Accepted Answer · edited May 23 '17 at 10:26

Almost always, when trying to parse HTML, Regex is not the answer. Regex is, as its name suggestions, for parsing regular text. HTML is not regular text.

Have a read of this answer:

RegEx match open tags except XHTML self-contained tags

Instead you are so much better off using a tool that is designed for working with HTML. I'd suggest using "HtmlAgilityPack" (which you can NuGet by that name).

Here's how you could make it work.

First I'm going to create a simple function to anonymize text:

Func<string, string> anonymize =
    t => new String(
        t
            .ToCharArray()
            .Select(x =>
                Char.IsSeparator(x)
                    ? x
                    : (Char.IsUpper(x) ? 'X' : 'x'))
            .ToArray());

Now I can use HtmlAgilityPack to do all of the rest of the heavy lifting:

var html = new HtmlAgilityPack.HtmlDocument();

html.LoadHtml(@"<html>
  <!-- comments here -->
  <body>
   <p>Some random text</p>
  </body>
 </html>");

var textNodes =
    html
        .DocumentNode
        .Descendants()
        .OfType<HtmlAgilityPack.HtmlTextNode>()
        .Where(x => !String.IsNullOrWhiteSpace(x.Text))
        .ToArray();

foreach (var textNode in textNodes)
{
    textNode.Text = anonymize(textNode.Text);
}

var output = html.DocumentNode.OuterHtml;

The output I get is:

<html>
  <!-- comments here -->
  <body>
   <p>Xxxx xxxxxx xxxx</p>
  </body>
 </html>

score 0 · Answer 2 · answered Jul 05 '15 at 01:22

I believe your best bet is to use a Html parse. Have a loot at Html Agility pack

If you want RegEx to replace your content between tags,try something like this:

<([a-zA-Z]+).*?>(.*?)</\\1>

The second group is the content between the tag.

If you want to remove/replace comments , use the following RegEx

<!--(.*?)-->

Amr Ayman · Answer 3 · 2015-07-05T03:01:43.887

-1

Here is a regex that'll ensure it's not a tag or a comment (Use dot selects all option, replace group 4):

(?!<!--.+)<([a-zA-Z]+([a-zA-Z]|-)+)>(.*)</\1>

edited Jul 05 '15 at 03:01

answered Jul 05 '15 at 02:19

Amr Ayman

1,129
1
8
24

This doesn't match at all against the OP's source HTML. – Enigmativity Jul 05 '15 at 02:26
@Enigmativity: It does now. – Amr Ayman Jul 05 '15 at 03:02

How to get all the character contents of an html converted to "x" excluding comments

3 Answers3