C# - Processing html tag attributes

Question

I'm getting some html data from remote server and before displaying it in the UI of application i need to make some changes, i.e. delete counters, replace links, etc. Removing some tag with contents and changing specific link is not a big deal, but when it comes to some advanced processing, i have some problems.There is a need to replace (delete) few html tag attributes (not a tag itself - there are plenty of examples over internet about this). For example : delete all onmouseover handlers from buttons. I know that XPath would be a perfect fit for such problem, but i don't know it at all and although my information is XHTML-complaint, it's stored in a string variable and not queryable :(. So i'm trying to use Regular Expressions to solve this problem, with no success for now. I guess it's a mistake in pattern...

public string Processing (string Source, string Tag, string Attribute)
{    
return System.Text.RegularExpressions.Regex.Replace(Source, string.Format(@"<{0}(\s+({1}=""([^""]*)""|\w+=(""[^""]*""|\S+)))+>", Tag, Attribute), string.Empty);
}

...

string before = @"<input type=""text"" name=""Input"" id=""Input"" onMouseOver=""some js to be eliminated"">";
string after = Processing(before,"input","onMouseOver");
// expected : <input type="text" name="Input" id="Input">"

Alan Moore · Accepted Answer · 2009-03-22T04:24:18.657

1

That's an interesting approach, but like bobince said, you can only process one attribute per match. This regex will match everything up to the attribute you're interested in:

@"(<{0}\b[^>]*?\b){1}=""(?:[^""]*)"""

Then you use "$1" as your replacement string to plug back in everything but the attribute.

This approach requires you to make a separate pass over the string for each of your target tag/attribute pairs, and at the beginning of each pass you have to create and compile the regex. Not very efficient, but if the string isn't too large it should be okay. A much bigger problem is that it won't catch duplicate attributes; if there are two "onmouseover" attributes on a button, you'll only catch the first one.

If I were doing this in C# I would probably use the regex to match the target tag, then use a MatchEvaluator to remove all of the target attributes at once. But seriously, if the string really is well-formed XML, there's no excuse for not using XML-specific tools to process it--this is what XML was invented for.

edited Mar 22 '09 at 04:24

answered Mar 21 '09 at 05:25

Alan Moore

73,866
12
100
156

It seems like the closing round bracket of the group is missing (regex doesn't compile). Fixed expression : @"(<{0}\b[^>]*?\b)({1}=""(?:[^""]*)"")" – Jaded Mar 21 '09 at 09:46
And, of course, thanks a lot, your hint is actually what i needed. – Jaded Mar 21 '09 at 10:37
Oops. Actually, the opening round bracket just before the {1} shouldn't be there. There's no point capturing the attribute, since all you're doing is deleting it. – Alan Moore Mar 21 '09 at 15:30
I keep forgetting we can edit our answers forever on SO. Regex fixed. – Alan Moore Mar 22 '09 at 04:26

score 0 · Answer 2 · answered Mar 20 '09 at 23:51

I know that XPath would be a perfect fit for such problem

Quite so. Or any other XML parser-based technique, such as DOM methods.

It's really not a hard thing to learn: stuff your string into the XmlDocument.LoadXml() method then call selectNodes() on it with something like '//tagname[@attrname]' to get a list of elements with the unwanted attribute. Peasy.

i'm trying to use Regular Expressions to solve this problem, with no success

What is it with regexes? People keep using them even when they know it's the wrong thing, even though they're frequently unreadable and difficult to get right (as the endless “why doesn't my regex work?” questions demonstrate).

So what's so attractive about the damned things? There are several questions on SO every day about parsing [X][HT]ML with regex, all answered “don't use regex, regex is not powerful enough to parse HTML”. But somehow it never gets through.

I guess it's a mistake in pattern...

Well the pattern appears to be trying to match entire tags to replace with an empty string, which isn't what you want. Instead you'd want to be targeting just the attribute, then to ensure only attributes inside a “<tag ...>” counted, you'd have to use a negative lookbehind assertion — “(?!<tag )”. But you usually can't have a variable-length lookbehind assertion, which you would need to allow other attributes to come between the tag name and the targeted attribute.

Also your ‘\S+’ clause has the potential to gobble up large amounts of unintended content. As you've got well-formed XHTML, you're guaranteed properly quoted attributes, so you don't need that anyway.

But the mistake is not the pattern. It is regex.

Sure. Regex are useful for many problems. But if the questions on SO are anything to go by — and judging by the amount of real-world coding horror I've seen, they probably are — a majority of regex usage is totally inappropriate. — bobince, Mar 20 '09 at 23:57
Well... I thought Regular Expressions are better than something as follows : Source.Substring(Source.IndexOf(Attribute),Attribute.Length + ParameterLength) or something... Plus a document i'm working with appears to be not fully XHTML complaint. It has xml namespace included, but fails validation. — Jaded, Mar 21 '09 at 09:40
“Validation” is not important for processing it as XML, it only has to be “well-formed”. Otherwise, there are HTML parsers such as the Agility Pack that are still much, much easier than trying to hack out a regex. — bobince, Mar 21 '09 at 13:06

score 0 · Answer 3 · answered Mar 21 '09 at 10:36

So, the rewritten code is :

public static string Process(string Source, string Tag, string Attribute)
{
        return Regex.Replace(Source, string.Format(@"(<{0}\b[^>]*?\b)({1}=""(?:[^""]*)"")", Tag, Attribute), "$1");                  
}

I've tested it, and it works fine.

string before = @"<input type=""text"" name=""Input"" id=""Input"" onMouseOver=""some js to be eliminated1""/>"
        + "\r\n" + @"<input type=""text"" name=""Input2"" id=""Input2"" onMouseOver=""some js to be eliminated2"">"
        + "\r\n" + @"<input type=""text"" name=""Input3"" id=""Input3"" onMouseOver=""some js to be eliminated3"">";            
string after = Process(before, "input", "onMouseOver");
//<input type="text" name="Input" id="Input" />
//<input type="text" name="Input2" id="Input2" >
//<input type="text" name="Input3" id="Input3" >

For now the problem is solved. I'd try to use a xml-related workaround, but it seems like before creating XmlDocument i need to rework input html again, because according to w3c validator it has errors. It starts as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <HTML xmlns="http://www.w3.org/1999/xhtml">
    <HEAD>
    <TITLE>page title</TITLE>

On LoadXml i get "System.Xml.XmlException about '>' marker is not acceptable - line 1 position 63. Adding document type definition causes the same exception but this time about '--' marker incorrect , '>' expected.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/strict.dtd">

Any ideas ? Or let it go ?)

If it says in upper case, it's not XHTML — probably the original legacy-HTML doctype is more appropriate and the ‘xmlns’ is just lies. — bobince, Mar 21 '09 at 13:08
(And we can't see it from the input posted, but the error about ‘--’ is usually a sign of a broken comment like “”, which is invalid in both HTML and XHTML, but will be handled OK by browsers and the Agility Pack. — bobince, Mar 21 '09 at 13:09

C# - Processing html tag attributes

3 Answers3