0

I want a regular expression to remove the following:

<a class="a" href="a.com">string</a>

What I want is if there was a class attribute in the tag i want the whole tag removed (<a class="a" href="a.com"></a>) and the the string between tag retrieved (string), else keep it as it's.

joce
  • 9,624
  • 19
  • 56
  • 74
ykh
  • 1,775
  • 3
  • 31
  • 57

2 Answers2

3

I suggest using an HTML parser like the HTML Agility Pack instead of trying to do this with RegEx - RegEx is not a good tool for parsing general HTML, as this answer explains.

The download comes with a bunch of Visual Studio projects as examples for usage.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
0

Given you want to parse HTML, it's way better to use XML parser, that's what others already recommended.

But since you want regex, I've come up with this: http://regexr.com?2vuqs

<([^ ]+)([ \t]+[a-zA-Z-]+=(["'])[^\3]+?\3)*[ \t]+class=(["'])[^\4]\4([ \t]+[a-zA-Z-]+=(["'])[^\6]+?\6)*>([^<]+)</(\1)>

It's not fail proof, but it should handle most situations. Check the link to see it works.

Mikulas Dite
  • 7,790
  • 9
  • 59
  • 99
  • The regex you wrote does the job but it's missing one thing, the string between the tags is removed, can you alter the regex to keep the string between the tags. – ykh Feb 08 '12 at 12:41
  • @user733659 Do you want *retrieve* the string, or remove the tag around it and keep it in the text? Either way, you should use the regex replace and not replace with empty string, but with group `$7`, which is the text inside the tag. – Mikulas Dite Feb 08 '12 at 13:39