How to "read" a HTML document in PHP?

Question

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.

I am writting a little php script that creates a PDF file from a dynamically created HTML file.

Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.

E.g.

<div><p>Test</p></div>

My script should recognize:

First tag is a div: do function for div Second tag is a p: do function for p

I don't know for what I should search. Regular expressions? HTML parser?

Thanks for a hint!

try [DOMDocument](http://php.net/manual/fr/class.domdocument.php) — mgraph, Mar 07 '12 at 15:59
Is this something you could use? http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php — Honnes, Mar 07 '12 at 15:59
@mgraph: Okay, I'll try this. Thanks a lot! Maybe you can give me a little example? I want to do with '
Heading
Text
' the following: if #content->h1 available print_h1(); It is an example without deeper sense. Just want to understand the basics because I tried to but don't. :-) — user1255102, Mar 07 '12 at 18:08

score 0 · Accepted Answer · answered Mar 07 '12 at 15:58

0

Try an XML parser. In PHP the SimpleXML is probably what you are looking for.

answered Mar 07 '12 at 15:58

score 0 · Answer 2 · answered Mar 07 '12 at 16:59

0

I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).

answered Mar 07 '12 at 16:59

radmen

score -1 · Answer 3 · answered Mar 07 '12 at 16:03

-1

answered Mar 07 '12 at 16:03

bPratik

@simplemotives although it isn't best practice, when doing *hacky* things that OP is after I believe the combo foots the bill. If OP has control over the 2 ends of the system (HTML Creating and PDF Creating) then the actual html file can be written in a standards compliant way, and thus parsed using a good RegEx – bPratik Mar 07 '12 at 16:11
I still disagree. Even if the OP controls the HTML, he may still have multiple tag variations that complicate regex parsing. Even if this is a hacky script, there are dom libraries that make this a lot easier. http://simplehtmldom.sourceforge.net/ Include the lib, load the html, use selectors to get the elements you want and avoid the complications. – racerror Mar 07 '12 at 17:04
agree, but what do you propose? a parser as suggested by @xato? If so, which one, and am I not right in thinking that any third party parser would internally still use some form of expression matching? then again, this is not really my field of expertise! :) – bPratik Mar 07 '12 at 17:08

Heading