0

Using PHP, I am trying to find an easy way of parsing HTML files that also contain non-HTML content such as custom tags & inline PHP code segments. An example of elements I need to cater for without it choking would be as follows:

<!DOCTYPE html>
<html [[angular tag 1]]>
<head <?php echo 'php snippet 1'; ?>>
    <title {{curly tag 1}}></title>
    <link [[angular tag 2]]="{{curly tag 2}}.css" />
    <script src="<?php echo 'php snippet 2'; ?>.js"></script>
</head>

<body>
    <?php echo 'php snippet 3'; ?>
    <!-- comment 1 -->
    [[angular tag 3]]
</body>
</html>

This is just a simple example and another need might be to process partial HTML snippets that don't necessarily include the html, head & body tags. As you can see tags & PHP snippets can occur anywhere throughout the document as long as they are properly nested within that relevant entity:

  • as html tags (top level or nested);
  • attributes (with or without a value)
  • inside attribute values

I need the PHP code snippets curly "tags" & angular "tags" to be parsed into tokens - they do not need to be processed themselves - I need to do that after parsing. I also at this stage don't see the need to cater for nested tags either within themselves or within the php code snippets.

Ideally I would like to find a library or at the very least a set of files that already implement something that can do this; and not have to do it myself.

As far as I know DOMDocument & SimpleXML don't support malformed XML syntax or foreign elements so they cannot be used to process this unless I strip out the custom tags & php code and then re-insert it afterwards; but that would probably require just as much work as rolling my own parser.

Caveat: Please reserve comments about not including php code in view logic, etc. I am aware of these sorts of design principles.

Precastic
  • 3,742
  • 1
  • 24
  • 29
  • 1
    So what's your question? And before you say, "how do I do it?", remember that questions like that are too broad and will be closed as such. So if you've written any code thus far I highly recommend adding it to your question. You should read "[how do I ask a good question?](http://stackoverflow.com/help/how-to-ask)". – John Conde May 30 '15 at 12:17
  • Please see answers to this question: http://stackoverflow.com/questions/2093228/lex-and-yacc-in-php – Kuba Wyrostek May 30 '15 at 12:18
  • @JohnConde I am looking for ANY suggestions of how to accomplish this - from a library to PHP classes to how to implement it myself. If I had written code I would not have said "not have to do it myself". Not all questions are simple 1 liners that fit into a neat little box! Stackoverflow is heading towards those sorts of questions these days - so sad! – Precastic May 30 '15 at 12:30
  • @KubaWyrostek Thanks for the link but none of the answers conclude anything other than "do it yourself" since the PEAR package is no longer maintained & that was the only really suggestion – Precastic May 30 '15 at 12:33

3 Answers3

2

It's important to understand that the mere presence of code snippets of the format <?php ?> doesn't make your code invalid.

Both SGML and XML support any tags of the format <?PITarget PIContent?>, which are known as processing instructions. Any parser that doesn't know how to process a processing instruction is expected to ignore it. For example, browsers typically ignore any PHP code they find.

Processing instructions are exposed in the Document Object Model as Node.PROCESSING_INSTRUCTION_NODE. If you parse your document in PHP as a DOMDocument, such nodes have node type XML_PI_NODE. You can also find them in your DOMDocument using the processing-instruction() XPath command.

If you have code that is valid HTML5 but not valid XML, you might want to try Masterminds/html5-php. I use it myself under the hood of PHPPowertools/DOM-Query. I'm not sure how well it works with invalid HTML5, though, nor what it does with processing instructions.

John Slegers
  • 45,213
  • 22
  • 199
  • 169
  • You just made my day! Using the `Masterminds/html5-php` library doesn't produce any errors during parsing; unlike the DOMDocument & SimpleXML classes. Although it doesn't get it 100% right it should be a good enough starting point for me since it is open source! Thanks!!! – Precastic May 30 '15 at 13:03
  • I also didn't know about the validity of HTML with regards to PHP style tags so that info also helped! – Precastic May 30 '15 at 13:06
  • See answer below (http://stackoverflow.com/a/30547027/799588) for a further explanation of findings – Precastic May 30 '15 at 14:18
1

Based on the insight given in John's answer & some deductions made from the output given by Masterminds/html5-php I have found that the only real problem I was having with using DOMDocument was that I was using PHP tags within html opening or closing tags. I.e. between the < & > characters. In hindsight this all makes perfect sense.

So the only parts of the offending HTML template that actually stop it from parsing properly are <head <?php echo 'php snippet 1'; ?>> and <script src="<?php echo 'php snippet 2'; ?>.js"> since there are nested angular braces which are obviously fundamentally invalid HTML.

This means that by simply updating the HTML template to use custom tags in those instances it does away with the malformed output & critical parsing errors. This is satisfactory for my needs & I actually feel more elegant because it doesn't result in nested angular brackets in the HTML template - even if the PHP parser handles it whilst processing a PHP file.

The updated workable template would look something like this instead:

<!DOCTYPE html>
<html [[angular tag 1]]>
<head [[replaced PHP code snippet 1]]>
    <title {{curly tag 1}}></title>
    <link [[angular tag 2]]="{{curly tag 2}}.css" />
    <script src="[[replaced PHP code snippet 2]].js"></script>
</head>

<body>
    <?php echo 'php snippet 3'; ?>
    <!-- comment 1 -->
    [[angular tag 3]]
</body>
</html>

The code I used to test this was:

switch(1) {
    case 1: {
        $log->info( 'Masterminds/html5-php' );
        $html5 = new HTML5();
        $dom = $html5->loadHTML( $szTemplate );
        echo $html5->saveHTML( $dom );
        exit;
    }
    case 2: {
        $log->info( 'DOMDocument' );
        $doc = new \DOMDocument();
        $doc->loadHTML( $szTemplate );
        echo $doc->saveHTML();
        exit;
    }
}
Community
  • 1
  • 1
Precastic
  • 3,742
  • 1
  • 24
  • 29
0

If you want to do this after PHP is already parsed and sent for output. Include the file with tokens and capture the parsed HTML using output buffering; and then parse the remaining tags.

When you have the parsed HTML captured in a variable, you'd either:

  1. preg_match_all('#{{[[:alnum:]_]}}', $HTML, $curlies_found); to capture the tokens, and then replace tokens matched with the corresponding values, e.g. by looping over the matched tokens and replacing with keys from your $curly_tokens array.

  2. str_replace over the HTML with all of your token variables; str_replace('{{token_foo}}', $curly_tokens['token_foo'], $HTML);.

Repeat process for each type of token. The first approach may be more economical if you have a lot of tokens to search-and-replace for, all of which may not be in a given template. If you have a small number of tokens that are mostly present in templates, the second is likely faster.

I don't think you need a library for this, a few dozen lines of code tops is quite sufficient for a basic implementation of token parsing. Please see my answer here on simple token parsing.

If you converted the PHP snippets you have in your HTML into tokens, you could then simply use file_get_contents to fetch your HTML templates and parse for the tokens, rather than fiddling with include and output buffering. But whichever way works best for you, your call.

Community
  • 1
  • 1
Markus AO
  • 4,771
  • 2
  • 18
  • 29
  • Thanks Markus but one should never use regular expressions to parse HTML (see http://stackoverflow.com/a/6751339/799588) I should probably have stated that I also need to process the HTML tags & attributes; hence why I am not looking for only the tags as tokens but also the HTML elements & their attributes. – Precastic May 30 '15 at 12:39
  • Parsing HTML is one thing. Here we aren't parsing HTML, since we have no need to understand the DOM structure. We're just replacing tokens using regular expressions. – Markus AO May 30 '15 at 12:44
  • It would be useful if you posted the data you intend to use for the tokens that need to be replaced. If you actually need to have HTML tags and attributes generated, that's an issue different from simple token replacement. – Markus AO May 30 '15 at 12:45
  • I am needing to understand the DOM structure - that is what I meant by "I am not looking for only the tags as tokens but also the HTML elements & their attributes" - I need the HTML tags & their attributes in tokens as well so that I can process them – Precastic May 30 '15 at 12:47
  • What I'm a inserting into the tags is irrelevant since that happens after the parsing stage. Otherwise it becomes a dependency mess. – Precastic May 30 '15 at 12:48
  • So are you saying that the processing of each token also depends on the tag it's surrounded by? The example you've provided seems to be a straight-forward job that wouldn't call for a full DOM parsing. Simply put your tags and their properties into an array and generate HTML tags and attributes for insertion. In your example, I don't see how you could break the DOM by inserting ready-parsed HTML. Perhaps you can clarify the question by spelling out a bit more explicitly what you hope to do with the tokens, what source data you have, and what should be done to it before it's good to insert. – Markus AO May 30 '15 at 12:51
  • I now have a good starting point from John's answer so no need to discuss this further, but thanks for your help! – Precastic May 30 '15 at 13:05