Using PHP, I am trying to find an easy way of parsing HTML files that also contain non-HTML content such as custom tags & inline PHP code segments. An example of elements I need to cater for without it choking would be as follows:
<!DOCTYPE html>
<html [[angular tag 1]]>
<head <?php echo 'php snippet 1'; ?>>
<title {{curly tag 1}}></title>
<link [[angular tag 2]]="{{curly tag 2}}.css" />
<script src="<?php echo 'php snippet 2'; ?>.js"></script>
</head>
<body>
<?php echo 'php snippet 3'; ?>
<!-- comment 1 -->
[[angular tag 3]]
</body>
</html>
This is just a simple example and another need might be to process partial HTML snippets that don't necessarily include the html, head & body tags. As you can see tags & PHP snippets can occur anywhere throughout the document as long as they are properly nested within that relevant entity:
- as html tags (top level or nested);
- attributes (with or without a value)
- inside attribute values
I need the PHP code snippets curly "tags" & angular "tags" to be parsed into tokens - they do not need to be processed themselves - I need to do that after parsing. I also at this stage don't see the need to cater for nested tags either within themselves or within the php code snippets.
Ideally I would like to find a library or at the very least a set of files that already implement something that can do this; and not have to do it myself.
As far as I know DOMDocument & SimpleXML don't support malformed XML syntax or foreign elements so they cannot be used to process this unless I strip out the custom tags & php code and then re-insert it afterwards; but that would probably require just as much work as rolling my own parser.
Caveat: Please reserve comments about not including php code in view logic, etc. I am aware of these sorts of design principles.