Remove attributes with whitelist

Question

I need to remove attributes from a string with tags.

Here is the C# code:

strContent = Regex.Replace(strContent, @"<(\w+)[^>]*(?<=( ?/?))>", "<$1$2>", 
RegexOptions.IgnoreCase);

For example, this code will replace

This is some <div id="div1" class="cls1">content</div>. This is some more <span 
id="span1" class="cls1">content</span>. This is <input type="readonly" id="input1" 
value="further content"></input>.

with

This is some <div>content</div>. This is some more <span>content</span>. This is 
<input></input>.

But I need a "whitelist" when removing the attributes. In the above example, I want that "input" tag attributes must not be removed. So I want the output as:

This is some <div>content</div>. This is some more <span>content</span>. This is 
<input type="readonly" id="input1" value="further content"></input>.

Appreciate your help on this.

Trying to parse HTML with regex is DOOMED. Have you considered the HTML Agility Pack (loads HTML into a DOM like `XmlDocument`) or similar? Obligatory reading: http://stackoverflow.com/a/1732454/23354 — Marc Gravell, Nov 27 '13 at 09:14
Whilst I know regex is doomed for parsing HTML. This application of regex doesn't care that the input is HTML. You could replace the tag `<` with `"`s and then say "I want to cull each quoted string to only its first word unless the first word of the quote is `input`". — OGHaza, Nov 27 '13 at 09:28

OGHaza · Answer 1 · 2013-11-27T09:39:55.647

For your example you could use:

(<(?!input)[^\s>]+)[^>]*(>)

Replace with $1$2.

I'm not sure how you plan to specify the whitelist though. If you can hardcode it then you can easily add more (?!whitelistTag) to the above, which could done programmatically pretty easily from an array too.

Working on RegExr

In response to the usual You should not parse HTML with regex, you can rephrase the problem as:

This is a "quoted string", cull each "quoted string to its" first word unless the "string starts with" the word "string, like these last two".

Would you claim that regex shouldn't be used to solve that problem? Because it's exactly the same problem. Of course an HTML parser can be used for the job, but it hardly invalidates the idea of using regex for the same thing.

Remove attributes with whitelist

1 Answers1