0

I am trying to figure out a way through which I can replace all the contents of an html string source into "x" (case sensitive manner). I am able to do it using Regex.Replace() but it converts the tags as well. Also I would like to exclude comments from the conversion. For instance for a string html like:

 <html>
  <!-- comments here -->
  <body>
   <p>Some random text</p>
  </body>
 </html> 

The output should be:

 <html>
  <!-- comments here -->
   <body>
    <p>Xxxx xxxxxx xxxx</p>
   </body>
 </html>

Thank you for your help.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
KedarP
  • 13
  • 4
  • No Regex : string input = "Some random text"; string output = string.Join("",input.Select(x => char.IsWhiteSpace(x) ? " " : char.IsUpper(x) ? "X" : "x")); – jdweng Jul 05 '15 at 00:45
  • Loop over content char per char and do this: if char is `>` start substitution next char, if char is `<` stop substitution. If substitution is on, see if char is lower or upper and replace for corresponding `x` or `X`, else do nothing. Now get the next char. – ShellFish Jul 05 '15 at 00:55
  • Does this include the contents of attributes? – Enigmativity Jul 05 '15 at 01:42

3 Answers3

2

Almost always, when trying to parse HTML, Regex is not the answer. Regex is, as its name suggestions, for parsing regular text. HTML is not regular text.

Have a read of this answer:

RegEx match open tags except XHTML self-contained tags

Instead you are so much better off using a tool that is designed for working with HTML. I'd suggest using "HtmlAgilityPack" (which you can NuGet by that name).

Here's how you could make it work.

First I'm going to create a simple function to anonymize text:

Func<string, string> anonymize =
    t => new String(
        t
            .ToCharArray()
            .Select(x =>
                Char.IsSeparator(x)
                    ? x
                    : (Char.IsUpper(x) ? 'X' : 'x'))
            .ToArray());

Now I can use HtmlAgilityPack to do all of the rest of the heavy lifting:

var html = new HtmlAgilityPack.HtmlDocument();

html.LoadHtml(@"<html>
  <!-- comments here -->
  <body>
   <p>Some random text</p>
  </body>
 </html>");

var textNodes =
    html
        .DocumentNode
        .Descendants()
        .OfType<HtmlAgilityPack.HtmlTextNode>()
        .Where(x => !String.IsNullOrWhiteSpace(x.Text))
        .ToArray();

foreach (var textNode in textNodes)
{
    textNode.Text = anonymize(textNode.Text);
}

var output = html.DocumentNode.OuterHtml;

The output I get is:

<html>
  <!-- comments here -->
  <body>
   <p>Xxxx xxxxxx xxxx</p>
  </body>
 </html>
Community
  • 1
  • 1
Enigmativity
  • 113,464
  • 11
  • 89
  • 172
0

I believe your best bet is to use a Html parse. Have a loot at Html Agility pack

If you want RegEx to replace your content between tags,try something like this:

<([a-zA-Z]+).*?>(.*?)</\\1>

The second group is the content between the tag.

If you want to remove/replace comments , use the following RegEx

<!--(.*?)-->
ANewGuyInTown
  • 5,957
  • 5
  • 33
  • 45
-1

Here is a regex that'll ensure it's not a tag or a comment (Use dot selects all option, replace group 4):

(?!<!--.+)<([a-zA-Z]+([a-zA-Z]|-)+)>(.*)</\1>
Amr Ayman
  • 1,129
  • 1
  • 8
  • 24