-1

I'm looking for a regular expression to isolate an html tag. This includes the TAG the ATTRIBUTES and the CONTNET inside.

Let's say I have this:

<html> 
<body>
aajsdfkjaskd 
<TAGNAME name="bla" context="non">hfdfhdj </TAGNAME>
</body>
 </html>

I need a regular expression that would return:

<TAGNAME name="bla" context="non">hfdfhdj </TAGNAME>

Thank, Joe

orit cohen
  • 101
  • 1
  • 2
  • 7
  • 8
    [Don't use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – John Conde Jul 11 '12 at 15:11
  • In the **general** case, that's not possible. If there are specific constraints on the nature of the HTML surrounding and/or contained in the tag, you should describe those. – Pointy Jul 11 '12 at 15:11
  • @Pointy: I believe this to be possible in C# regex (which support balanced matching). And I believe, that nobody really wants to do that =) – Jens Jul 11 '12 at 15:16

5 Answers5

2

Don't use a regex, use an HTML parser instead. Much more reliable and easier to work with.

If you're a PHP developer I recommend you use this one (http://simplehtmldom.sourceforge.net/).

1

Look into HTML Agility Pack it will make things a lot easier.

Stephen Gilboy
  • 5,572
  • 2
  • 30
  • 36
0

use this regex <TAGNAME.+?</TAGNAME>

burning_LEGION
  • 13,246
  • 8
  • 40
  • 52
0

If this is the main thing you're trying to do, XLST is a good tool to do it with. You can easily select just TAGNAME and copy over the attributes and text. See http://www.w3schools.com/xsl/ for an intro.

rene
  • 41,474
  • 78
  • 114
  • 152
WBT
  • 2,249
  • 3
  • 28
  • 40
0

First of all: don't do this. Parsing HTML with regex is a maintenance nightmare and will most probably fail on any real world example of HTML. There are better options (like using a HTML parser like the HTML Agility pack).

To answer your question though, the following regex will do what you want if the HTML code

  • is well formed (no missing closing tag, etc)
  • does not contain comments with "TAGNAME" in them
  • does not contain script blocks with "TAGNAME" in them
  • maybe more

It can be expanded to cover some of these cases, but you really don't want to =)

    <TAGNAME(<TAGNAME (?<tagcounter>)|</TAGNAME>(?<-tagcounter>)|.)*</TAGNAME>(?(tagcounter)(?!))

You'd need RegexOptions.SingleLine, too. See it in action at Ideone.com

Jens
  • 25,229
  • 9
  • 75
  • 117