0

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.

I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.

I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).

The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!

UPDATE:

Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.

FURTHER UPDATE:

I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.

hakre
  • 193,403
  • 52
  • 435
  • 836
Endophage
  • 21,038
  • 13
  • 59
  • 90
  • 3
    How could you miss this famous question (and answer): http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags –  Mar 02 '11 at 00:38
  • @Tim Cooper Saw that, doesn't actually provide an answer. If there isn't a better than regex solution, I'm going to have to use a regex... – Endophage Mar 02 '11 at 00:39
  • There is almost always a better than regex solution, and the linked page does provide several of them in the answers. They are called parsers. – glomad Mar 02 '11 at 00:43
  • The famous answer is as famous as wrong. Regular expressions can match nested tags. It's just prohibitively more complex and requires recursive `(?R)` patterns. It's way easier to use [phpquery or querypath](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php). – mario Mar 02 '11 at 00:44
  • 3
    Calling the famous answer wrong is by now almost as famous as the famous answer. :) The fact is that the famous answer is as good as right for nearly every case. If you insist on building a huge complicated regex to parse complex HTML, you are creating more work for yourself and anyone who has to maintain your code later. And insisting "yes it can be done" does not help anyone. – glomad Mar 02 '11 at 00:53
  • @mario: But if you want to get even really really really more technical, the famous answer *is* correct because HTML is a Chomsky type 2 grammar and is therefore outside of the domain that is describable by regular expressions. A regular expression is not equivalent to a parser. At least that's what they told me at the liquor store. //edit... OK, your comment is gone. – glomad Mar 02 '11 at 01:37
  • @mario: From the OP: "I'm parsing html snippets". Anyway, I totally agree with you about keeping real-world requirements in mind. It's just that, except for the most simple, rigid, unchanging HTML structures, I think using a regex for this purpose is always the less practical choice. In this case, where the HTML is of indeterminate structure, I think the line has been crossed into "regexes are not the right tool" territory. – glomad Mar 02 '11 at 02:10

2 Answers2

2

As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.

Relevant questions

Community
  • 1
  • 1
Matt Ball
  • 354,903
  • 100
  • 647
  • 710
  • As I said, I'm aware of the limitations of regexes, but I have to do it somehow. I'll read your links. – Endophage Mar 02 '11 at 00:44
  • @Endophage: the "somehow" is _using an HTML parser._ – Matt Ball Mar 02 '11 at 00:45
  • @Matt Ball as I asked Andrew, do you know of a parser that plays nicely with snippets? – Endophage Mar 02 '11 at 00:47
  • 1
    @Endophage: I'm really not a PHP guy; if [`DOMDocument->loadHTML()`](http://docs.php.net/manual/en/domdocument.loadhtml.php) can't handle standalone snippets (I don't know if this is the case or not), then just wrap them in `` and problem's solved. – Matt Ball Mar 02 '11 at 00:50
  • 1
    @Matt Ball It's a workaround... in general I try to avoid that kind of thing so that somebody later down the road doesn't break it (I couldn't event try to count how many people I know ignore comments). As nobody can tell me if `DOMDocument->loadHTML()` can handle snippets I guess I'll have to go away and test it. – Endophage Mar 02 '11 at 00:55
  • 1
    I think the term you're looking for is "document fragments", not "snippets". Maybe this will help your search. – glomad Mar 02 '11 at 00:59
  • `DOMDocument->loadHTML()` adds a DTD and ``, `` tags to the returned string... Then I would have to use a regex to match on the `` tags to get just the content out. Seems like I can't escape from them entirely. – Endophage Mar 02 '11 at 01:07
  • 1
    Not really. Since the added DTD and tags do not change, you can use a simple str_replace() to get rid of them. Also, it makes sense that these are added, because (X|HT)ML can only be parsed or validated against a schema of some kind. Otherwise there are no rules on which tags are allowed in which contexts. – glomad Mar 02 '11 at 01:12
  • @ithcy I understand why the DTD and tags are added, but stripping them with `str_replace()` isn't an option unless somebody can guarantee that the DTD added will never change. I can't write this and have it break in 6 months when the servers get a PHP upgrade. – Endophage Mar 02 '11 at 01:16
  • 1
    True. If you wanted to be tricky, you could first use loadHTML() to parse a bogus document: `loadHTML('')`. Then get the strpos() of your bogus tag in the output, and you know that everything before that position and after (that position + bogus tag length) is junk added by loadHTML() and so you don't have to hard-code it into your str_replace(). Or, since you're not parsing, you could just use a regex :) Good luck. – glomad Mar 02 '11 at 01:25
1

Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

Andrew White
  • 52,720
  • 19
  • 113
  • 137