2

Imagine the following HTML:

<div>
  <b></b>
  <div>
    <table>...</table>
  </div>
</div> <!-- this one -->
...

How could I find the matching closing tag for the first opening div tag? Is there a reg ex that could find it? I guess this is quite a common requirement but I'm struggling to find anything straightforward, just full blown HTML parsers.

Sevki
  • 3,592
  • 4
  • 30
  • 53
cusimar9
  • 5,185
  • 4
  • 24
  • 30
  • 2
    What do you mean by "find an end tag"? What do you want to do with it? – SLaks Apr 28 '11 at 13:35
  • 2
    A regex will be a world of pain. You must use an HTML parser. – Richard H Apr 28 '11 at 13:36
  • 2
    you can use HtmlAgility for that http://htmlagilitypack.codeplex.com/ – Govind Malviya Apr 28 '11 at 13:36
  • 1
    http://stackoverflow.com/questions/841310/html-parser, http://stackoverflow.com/questions/1282258/net-html-parser, http://stackoverflow.com/questions/857912/net-html-dom-parser, http://stackoverflow.com/questions/100358/looking-for-c-html-parser – Grant Thomas Apr 28 '11 at 13:37
  • ... and what about self-closing tags like [br /] [hr /] [input /] [image /] etc? – RichardW1001 Apr 28 '11 at 13:38
  • You can't use a regex: regex if a finite automaton and you face a tree of potentially infinite depth. A regex that would only handle some reasonable maximum depth would be painfully complex and unmaintainable. – 9000 Apr 28 '11 at 13:46
  • "What do you want to do with it?" Well, one thing that this could be useful for is to delete the whole element. – Random832 Apr 28 '11 at 13:57

5 Answers5

4

No.

Use a full blown HTML parser. There's a reason they exist.

karlgrz
  • 14,485
  • 12
  • 47
  • 58
3

Use Html Agility Pack.

jgauffin
  • 99,844
  • 45
  • 235
  • 372
Govind Malviya
  • 13,627
  • 17
  • 68
  • 94
3

I'm assuming that you have tokeinized the html tags... Now create a stack and every time you see an opening tag push and everytime you see a closing tag pop... and see if the ones you pop macth the closing tag...

But there are already HTML parsers for this so search for one on codeplex.

Sevki
  • 3,592
  • 4
  • 30
  • 53
1

Well, You need to have a 'clear' view of the syntax ! However, regexp are very limited in scope and I would'nt recommand using it for multi-line/tag syntax.

You rather need to track each tag (open/close) and use a 'handler' to deal with your request. You could use some Lex/Yacc tools but this may be overkilling. Depending on the language you use, you may already have modules for this purpose (like HTMLParser in Python).

dcexcal
  • 197
  • 7
-1

There's always LinqToXml if you want to parse HTML and don't need every little detail.

LueTm
  • 2,366
  • 21
  • 31