1

I have a pretty simple reg ex question. My HTML tag looks like the following:

<body lang=EN-US link=blue vlink=purple>

I want to clear all attributes and just return <body>

There are a number of other HTML tags whose attributes I'd like to clear so I hope to reuse the solution. How to do this with a regular expression? Thanks, B.

mgnoonan
  • 7,060
  • 5
  • 24
  • 27
bearaman
  • 1,071
  • 6
  • 23
  • 43

5 Answers5

6

Use HtmlAgilityPack like this:

    public string RemoveAllAttributesFromEveryNode(string html)
    {
        var htmlDocument = new HtmlAgilityPack.HtmlDocument();
        htmlDocument.LoadHtml(html);
        foreach (var eachNode in htmlDocument.DocumentNode.SelectNodes("//*"))
            eachNode.Attributes.RemoveAll();
        html = htmlDocument.DocumentNode.OuterHtml;
        return html;
    }

Call this method passing the html that you want to remove all attributes from.

will help you a lot with this.

Don't use a regex for html files that may contain scripts, as in Javascript, the characters < and > are not tag delimiters but operators. A Regexp will probably match these operators as if they were tags, which will completely mess up the document.

carla
  • 1,970
  • 1
  • 31
  • 44
Miguel Angelo
  • 23,796
  • 16
  • 59
  • 82
  • I wasn't aware of HTML AgilityPack. Added it to my project and it worked as you described. Thanks. B. – bearaman Apr 26 '12 at 09:00
3

Don't use regex to parse HTML - it is not a good tool for this. This is particularly true if you do not have control over the incoming format of the HTML.

Use the HTML Agility Pack for this instead.

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
0

If your HTML isn't hopelessly broken, and the attributes don't contain > symbols, then it's as easy as:

<body.+?>

... and if you're looking to prevent XSS or something, disregard this.


If your attributes may contain other symbols, then here's a full example:

string data = @"<body lang=""EN-US>"" link=blue vlink=purple>";
Regex re = new Regex(@"<(body).*?(""[^""]+""[^"">]+)*>");

Console.WriteLine(re.Replace(data, "<$1>")); // <body>

Note that the HTML still needs to be well-formed, though.

Ry-
  • 218,210
  • 55
  • 464
  • 476
0

In general its not recommended to use regex to parse html, but if you have to use it,
for your problem, something like below will work.

In this regex, 'body' is OR'd with 'span' as an example. Also note that comments are ignored because they could hide html. Script is taken into account for the same reason.

I would leave the comment section in. You must be aware that scripts can alter the document rendering and use language constructs that can hide html that you may want to process. Of course that shouldn't be done with regex.

If you want, you can remove the 'script' sub-expression in the hopes of modifying possible string constants containing what you want to alter. Not recommended though.

Raw regex (modifiers: expanded, 'dot includes newlines')
In C# the regex captured buffers could be named so that each OR'd sub-expression contains the same names. Example: (?<begin> ..) .. (?<end> ..) | (?<begin> ..) .. (?<end> ..)
so that the replacement is just ["begin"] + ["end"]. This is buggy in Perl 5.10, so I just use the capture buffer numbers, Dot Net might work correctly.

Search

 # (1,2)
   ( <!--.*?--> ) ()
|
 # (3,4)
   (
     (?:
        <script
          (?>
             (?:\s+(?:".*?"|'.*?'|[^>]*?)+)?
             \s*
        >
          )(?<!/> )
        .*?
        </script\s*>
      |
        </?script (?:\s+(?:".*?"|'.*?'|[^>]*?)+)? \s*/?>
     )
   ) ()
|
 # (5,6)
   ( <(?:body|span) ) (?!\s*/?>)
    \s+ (?:".*?"|'.*?'|[^>]*?)+ 
   ( /?> )

Replace

$1$2$3$4$5$6
0

The following Regular Expression cleans the Attributes from all HTML/XML nodes in a given string.

\<[a-z]+\b([^>]+?)\s?\/?\>

As C# function this would look like this:

public string RemoveAttributes(string value){
   var attributeClean = new System.Text.RegularExpressions.Regex(@"\<[a-z]+\b([^>]+?)\s?\/?\>", System.Text.RegularExpressions.RegexOptions.Multiline | System.Text.RegularExpressions.RegexOptions.IgnoreCase);

   while (attributeClean.IsMatch(value)) {
      var match = attributeClean.Match(value);
      value = value.Remove(match.Index, match.Length);
   }
   return value;
}

If you want to clean specific elements only, you can use the following Regular Expression

\<(?:li|body)([^>]+?)\s?\/?\>

and add as many Elements into the first non-matching group separated by a |.

krizzzn
  • 1,413
  • 11
  • 13