0

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.

I am writting a little php script that creates a PDF file from a dynamically created HTML file.

Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.

E.g.

<div><p>Test</p></div>

My script should recognize:

First tag is a div: do function for div Second tag is a p: do function for p

I don't know for what I should search. Regular expressions? HTML parser?

Thanks for a hint!

user1255102
  • 486
  • 4
  • 16
  • 1
    try [DOMDocument](http://php.net/manual/fr/class.domdocument.php) – mgraph Mar 07 '12 at 15:59
  • Is this something you could use? http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php – Honnes Mar 07 '12 at 15:59
  • @mgraph: Okay, I'll try this. Thanks a lot! Maybe you can give me a little example? I want to do with '

    Heading

    Text

    ' the following: if #content->h1 available print_h1(); It is an example without deeper sense. Just want to understand the basics because I tried to but don't. :-)
    – user1255102 Mar 07 '12 at 18:08

3 Answers3

0

Try an XML parser. In PHP the SimpleXML is probably what you are looking for.

0

I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).

radmen
  • 1,584
  • 9
  • 13
-1

What you need to do is read the HTML file into a PHP variable/object http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx

And then use RegEx to parse the HTML Tags and Attributes http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML

bPratik
  • 6,894
  • 4
  • 36
  • 67
  • @simplemotives although it isn't best practice, when doing *hacky* things that OP is after I believe the combo foots the bill. If OP has control over the 2 ends of the system (HTML Creating and PDF Creating) then the actual html file can be written in a standards compliant way, and thus parsed using a good RegEx – bPratik Mar 07 '12 at 16:11
  • I still disagree. Even if the OP controls the HTML, he may still have multiple tag variations that complicate regex parsing. Even if this is a hacky script, there are dom libraries that make this a lot easier. http://simplehtmldom.sourceforge.net/ Include the lib, load the html, use selectors to get the elements you want and avoid the complications. – racerror Mar 07 '12 at 17:04
  • agree, but what do you propose? a parser as suggested by @xato? If so, which one, and am I not right in thinking that any third party parser would internally still use some form of expression matching? then again, this is not really my field of expertise! :) – bPratik Mar 07 '12 at 17:08