Are there any good php libraries that can convert html/php documents into objects

Question

I see lots of php libraries that can parse html. A nice example is QueryPath which mimics the Jquery Api.

However, I am looking to analyse phtml. So, not only would the library be good at analysing the DOM, but also be good at analysing the php processing instructions. e.g The Php Document Object Model or PDOM.

A document like this:

<?php
require 'NameFinder.php';
$title = 'Wave Hello';
$name = getName();
?><html>
<head>
<title><?php echo $title ?></title>
</head>
<body>
<h1>Hello <?php echo $name ?></h1>
<p>Blah Blah Blah</p>
</body>

I'd like to be able to use this kind of php library to read things like:

the inner html of a DOM node, found by xpath or css selector.

as well possibly offering things like:

a list of php functions/method invoked in the script
values of php variables
pages required by that page
a list of php variables used before line 5
a list of php variables used before the 1st para of the body element

I could spend some time peicing something together, borrowing code from things like phpdocumentor and Zend Framework Reflection, using the in-built DOM Api, introspection and string manipulation, etc.

But, if there is some kind of *phtmlQuery" library out there that can do these kinds of things then it will handy.

What do you mean by `analysing the php processing instructions`? Actually interpreting / executing the PHP code? — nickb, Feb 05 '12 at 20:27
I don't think there is such a thing, and I don't think there *should* be. You might just as well separate the PHP and the HTML code and analyze them separately (the HTML with a parser, and the PHP maybe with a tool like Reflection or the tokenizer). What is your real-world use case for this? — Pekka, Feb 05 '12 at 20:29
there is an xml parser included in php core that could do this, but you would only be able to use it on valid xhtml pages, and not just normal html or broken xhtml. you would have to set up the parser to handle the processing instructions and it could get very complicated. — dqhendricks, Feb 05 '12 at 20:30
@nickb - Thanks for your comment. I mean interpreting the [DOM processing instruction nodes](http://www.w3.org/TR/REC-xml/#sec-pi), without executing it. — JW., Feb 05 '12 at 20:32
@pekka Use case - i am trying to write some code to convert an old legacy web site that has php mixed in with html and reverse engineer it into an MVC framework. It can't do it all programmatically, but there's alot of boring stuff that the computer can do for me. — JW., Feb 05 '12 at 20:34
@JW. what you are talking about sounds very error prone and time consuming. it sounds like you may have better luck doing a rewrite. — dqhendricks, Feb 05 '12 at 20:38
Having tried similar stuff in the past, that sounds like a hopeless enterprise to me. I agree with @dqhendricks - just rewrite the thing. Fancy tools will not help with bad code. — Pekka, Feb 05 '12 at 20:39
@dqhendricks - Yes. I edited my previous comment as you wrote your previous comment. "It can't do it all programmatically, but there's alot of boring stuff that the computer can do for me". — JW., Feb 05 '12 at 20:41
But what boring stuff would that be exactly? You won't be able to get any variable values without actually *running* the thing (in which case, you may be better off with a debugger anway). All you will be able to get is maybe a list of variables used... *if* the legacy code isn't using things like variable variables or `eval()` — Pekka, Feb 05 '12 at 21:03
@Pekka - Ok, ok ok. You've convinced me. Its impossible. I won't try to do it. ......promise. ;o) — JW., Feb 05 '12 at 21:14
@JW well, if you find something in this direction that really helps you deal with legacy code, I'm happy to be proven wrong! But my experience really is that a rewrite is *always* the faster way to go. :) (That said, some of the things you said here sound like a PHP debugger could do them? Maybe it's worth trying some out.) — Pekka, Feb 05 '12 at 21:17
@Pekka - I was being a bit cheeky cos I'm in a cheeky mood at the moment. I understand what you mean. If I was trying to anaylise 'bad' arbitrary code then it would be a nightmare. However, it is quite well structured phtml and I understand the limits of what I can derive from it. I know for sure there are a few tasks that can be automated to save me time. The '[rewrite](http://programmers.stackexchange.com/questions/100680/when-should-you-rewrite) over [refactor](http://www.joelonsoftware.com/articles/fog0000000069.html)' debate is always going to be raging. — JW., Feb 05 '12 at 21:31

score 3 · Accepted Answer · edited May 23 '17 at 11:48

To get the processing instructions (and other nodes) from your files, you can use DOM and XPath:

$dom = new DOMDocument;
$dom->loadHTMLFile('/path/to/your/file/or/url');
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//processing-instruction()') as $pi) {
    echo $dom->saveHTML($pi), PHP_EOL;
}

This will output:

<?php require 'NameFinder.php';
$title = 'Wave Hello';
$name = getName();
?>
<?php echo $title ?>
<?php echo $name ?>

This will work with broken HTML. You can find additional libraries at

How do you parse and process HTML/XML in PHP?

Once you got the processing instructions, you can either run them through the native Tokenizer or try some of these:

Those won't magically give you the information you seek out of the box, so you will likely need to write a few additional lines on your own.

Thanks. Those Reflection libraries are really good links. Just the kind of thing I need. I'm just starting to learn [QueryPath](http://api.querypath.org/docs/extensions.html). So, when I'm ready, I'll see if i can glue on PHP-Token-Reflection as an extension to QueryPath. If someone doesn't do it first. — JW., Feb 06 '12 at 03:56

score 0 · Answer 2 · answered Feb 05 '12 at 20:37

there is an xml parser included in php core that could do this, but you would only be able to use it on valid xhtml pages, and not just normal html or broken xhtml. you would have to set up the parser to handle the processing instructions and it could get very complicated.

http://www.php.net/manual/en/book.xml.php

http://www.php.net/manual/en/function.xml-set-processing-instruction-handler.php

score 0 · Answer 3 · answered Feb 05 '12 at 20:38

You could use PHP's token_get_all to tokenize the PHP so you could then walk the result and check for function calls and PHP values.

E.g.:

<?php

$src = <<<EOD
<?php
require 'NameFinder.php';
$title = 'Wave Hello';
$name = getName();
?><html>
<head>
<title><?php echo $title ?></title>
</head>
<body>
<h1>Hello <?php echo $name ?></h1>
<p>Blah Blah Blah</p>
</body>
EOD;

$tokens = token_get_all($src);

var_dump($tokens);

You still need to write a bit of code to walk over all the tokens, see what they are and then get the value based on the token type (function name, literal string, variable assignment etc), but this does a LOT of work for you as far as parsing the PHP.

Thanks for the tips. Yes I fear it would be a lot of work. I dream of some nice tool that is already out there that makes it nice n' easy. :o) — JW., Feb 05 '12 at 20:48

Are there any good php libraries that can convert html/php documents into objects

3 Answers3

Linked