Automatically parsing PHP to separate PHP code from HTML

Question

I'm working on a large PHP code base; I'd like to separate the PHP code from the HTML and JavaScript. (I need to do several automatic search-and-replaces on the PHP code, and different ones on the HTML, and different on the JS). Is there a good parser engine that could separate out the PHP for me? I could do this using regular expressions, but they're not perfect. I could build something in ANTLR, perhaps, but a good already existing solution would be best.

I should make clear: I don't want or need a full PHP parser. Just need to know if a given token is: - PHP code - PHP single quote string - PHP double quote string - PHP Comment - Not PHP, but rather HTML/JavaScript

I don't think you can build regex-es for any of PHP, HTML or JavaScript. Maybe for particular subsets of them. — Alin Purcaru, Nov 07 '10 at 17:23

Paul Dixon · Answer 1 · 2010-11-08T10:37:15.130

How about the tokenizer built right into PHP itself?

The tokenizer functions provide an interface to the PHP tokenizer embedded in the Zend Engine. Using these functions you may write your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.

You ask in the comments whether you can regenerate the code from the tokenized output - yet you can, all whitespace is preserved as T_WHITESPACE tokens. Here's how you might turn the tokenized output back into code:

$regenerated='';

$tokens = token_get_all($code);
foreach($tokens as $idx=>$t)
{
    if (is_array($t))
    {

         //do something with string and comments here?
         switch($t[0])
         {
             case T_CONSTANT_ENCAPSED_STRING:
                  break;
             case T_COMMENT:
             case T_DOC_COMMENT:
                 break;

         }
         $regenerated.=$t[1];


    }
    else
    {
         $regenerated.=$t;
    }
}

If I have the tokens, how do I regenerate the source text? With my original indentation and line spacing? With comments? Will the original programmer accept the result? — Ira Baxter, Nov 08 '10 at 01:33
Yes, whitespace and comments are treated like any other token - see my extended answer for how you'd rebuild the original source from the tokenized data. — Paul Dixon, Nov 08 '10 at 10:43

score 3 · Answer 2 · edited May 23 '17 at 11:53

To separate the PHP from the rest, PHP's inbuilt tokenizer is your best choice: See token_get_all()

For the rest, you might be best off with a DOM parser. Isolating the <script> parts (and external scripts, and even onXXXX events) is easy that way.

It might be tough to re-build the identical document from a parsed DOM tree, though - I guess it depends on what you need to do with the results and how clean the original HTML is. A regular expression (yuck!) could work better for that part.

Ira Baxter · Answer 3 · 2010-11-08T01:31:36.997

If all you want to do is to inspect the tokens, then the PHP tokenizer, as others have suggested, might be a good choice.

If what you want to do is to automatically change the source code in a reliable way, I'm not sure that will help you. How will you regenerate the modified source text?

Another way to do this is to use a program transformation engine. Such an engine can parse the source text to abstract syntax trees, capturing the structure of the program (as well as the effective content of all the tokens), and allow searching and transforming of those ASTs using reliable pattern matches/transformations. To do this well, you need an engine that parses PHP reliably, and can reproduce compilable source text from the changed AST.

Our DMS Software Reengineering Toolkit is such a program transformation system, and it has a robust PHP Front End that can process PHP5 accurately in terms of parsing, transforming and prettyprinting the result back to text. (Getting the PHP parser right is hard because the language is poorly documented). Because the front end can pick up the HTML and the PHP code accurately, you don't need to separate out the text; they will parked in clearly distinguisable places in unique tree nodes.

To change all echoed strings from lowercase to uppercase, you'd use DMS to parse the PHP, and then apply the following transformation rule:

 rule uppercase_echoed_string(s: STRING): statement -> statement
 =   "echo \s;" ->  "echo \uppercase\(\s\);".

This rule is written in DMS's Rule Specification Language (RSL), which is clearly not PHP. The stuff inside quote marks is PHP code; those are meta quotes wrapped around the text of the programmming language being manipulated. The \ chararacter is an meta-escape: \s indicates a metavariable that must match a string literal, \uppercase is the name of a DMS function external to the RSL language and the ( ) are meta parentheses around the meta-function call to uppercase, applied to the matched string \s. Because the rule operates on the ASTs, it cannot be confused; it won't change the text of /* echo 'def' */ because that isn't a statement.

You likely need several rules to handle the variety of syntax combinations: STRING in this case refers to just singly-quoted literal strings; doubly-quoted strings aren't monolithic entities but are composed of a series of QUOTED_STRING_FRAGMENTS that correspond to the text in a doubly quoted string between the PHP expressions inside that doubly-quoted string.

At the end of the transformation process, the changed AST is emitted complete with the original indentation and comments except where the transformations have been applied.

There's also a fully language accurate JavaScript parser for DMS, too, which you'd need if you wanted to process the content of SCRIPT tags accurately.

If you want to make reliable changes to source code, this IMHO is the only good way to do it. You can try string hacking and regular expressions, but parsing PHP requires a context free parser and REs don't do that, so any result you get won't be trustworthy.

Very interesting, Ira. That's certainly a robust approach. Ironically, one difficulty it might have is when PHP code is commented out, but should still be modified (a regex hack would still find it...). Also - the "program transformation" link you wrote is 404ing. — SRobertJames, Nov 07 '10 at 23:46
@SRobertJames: If you have code, sometimes in comments, sometimes not, you aren't going to be able to make reliable changes. How can you tell which comments are just comments, and which contain real code? If you insist on mixing these, you might consider editing the comments containing real code with a marker, indicating they are real code; then using DMS you could process the regular code, and for comments, you could check for marker and apply the DMS parser to the comment body, make the changes, prettyprint the mods and put them back into comments. (Link fixed, sorry). — Ira Baxter, Nov 07 '10 at 23:58
@SRobertJames: That raises the question as to why you have commented out code. You'd be better off, if the point of commenting it out, to put that code inside a conditional that is controlled by a configuration switch, e.g., *if feature7 { code }* . That gives you the advantage of disabling the code, of enabling by setting a feature switch, and not having to deal with the ambiguity of "does this comment contain real code?" — Ira Baxter, Nov 08 '10 at 00:00

Automatically parsing PHP to separate PHP code from HTML

3 Answers3

Linked