4

Context: my HTML5 documents not need Javascript, animations, forms... They are "only content". So, it can be filtered about these kind of representations, need only some subset/constraints of the "full-HTML5 representation". A good way to express this situation (and other broader ones!) is to say "my documents can be expressed with the Polyglot Markup constraints".

Question: Are there a tool that transforms (or filters losing spurious information) "any HTML5" into Polyglot XHTML5?
Preferably a tool based on extensions for DOM (or XSLT or XQuery).

Peter Krauss
  • 13,174
  • 24
  • 167
  • 304

2 Answers2

2

I'm not going to have a complete solution. In my mind there are two or even three stages in such a conversion:

Stage 1: get the HTML5 well formed

There's sort of black art to this first phase where the lack of well-structured requirement of HTML 5 needs to be accommodated for.

You need this before you have a DOM, before you have any chance of getting tools that expect something that remotely looks like xml to function.

So who's implemented such conversion: (almost?) every browser. Quite a few have source code. You can get this information out of a running browser as well: inspect the source code and see what it does with tag soup as input and you get well structured source code instead.

Another place to find such source code is in editors that allow you to edit xhtml in a webpage (FCKeditor and the like)

e.g. <p>para<ul><li>bullet</ul><p>para gets changed into <p>para</p><ul><li>bullet</li></ul><p>para</p>

Stage 2: filter out what's not allowed in Polyglot

Once the html tags are well structured, comes the next step where you have to remove what's not allowed in polyglot markup because there are differences with how it'll be interpreted between an html parser and a XML parser.

Those you might have a chance with XSLT, and building a filter, but you cannot validate it all as there is no DTD or anything equivalent for validating polyglot (x)html against. Even those few validators for xhtml5 that existed are being (have been) scrapped, so it'll make your quest a difficult one.

Anyway, trying to locate source of one of those validators that existed is your best option at finding source code that comes near this.

Stage 3: fix the external entities

Say what ? Well you can have beautiful polyglot (x)html and include a single javascript that does a single document.write and it all still fails. So you'll need to hunt down all of that too before it works.

  • Thanks (!), lets see. *Stage-1*: I am using something like `DOMDocument::loadHTML($HTML, LIBXML_NOCDATA | LIBXML_NOWARNING |...etc.. )` and enable [**recover** mode](http://nl1.php.net/manual/en/class.domdocument.php#domdocument.props.recover)... *Step-2*: In that time (~10 months ago) I was using [proper XSLT as here](https://github.com/ppKrauss/HTML5-onlyContent), like you described; *Step-3*: no problem, internal representation of standard DOM is always UTF8, so `saveXML()` or `C14N()` returns good XHTML. – Peter Krauss Sep 14 '15 at 22:49
  • ... Another approach to *Stage-1* is to use [tidy-html5](http://www.htacg.org/tidy-html5/) and its filters to start with secure XHTML. – Peter Krauss Sep 14 '15 at 22:55
  • Looking again at this, I notice one also needs to account for the limitation in polyglot xhtml to not use e.g. certain self closing entities that are valid xml, but not valid html5 (e.g. '''''''' isn't what you want in polyglot (x)html. –  Sep 15 '15 at 15:48
  • Stage 3 is not just what content the javascript might insert, but also HOW it inserts it: in XML mode document.write is forbidden (script will fail), even if it inserts perfectly valid polyglot markup. –  Sep 15 '15 at 15:50
  • Hello(!)... About "self closing", see [`C14N()`](http://www.w3.org/TR/xml-c14n) output, is a standard where `hr` is `
    ` (not `
    ` or `
    `-only)... About "javascript", I not understand... I am [striping](http://php.net/manual/en/function.strip-tags.php) scripts (!).
    – Peter Krauss Sep 15 '15 at 17:36
  • 1
    One must not use the
    syntax, but must use
    in polyglot html. They are called "void html elements". See section 4.6.1 in the polyglot document over at w3.org you linked in the question already. While others must use the syntax even when empty.
    –  Sep 16 '15 at 11:04
  • you right: we need to test what exactly C14N do... But an XSLT closed with the input syntax, do all the "Polyglot Rules" in few lines, in particular the [4.6.1 Void elements](http://www.w3.org/TR/html-polyglot/#empty-elements) "empty-element tag syntax" (area, br, hr, etc.) and "non-empty" (p,script, etc.). – Peter Krauss Sep 17 '15 at 18:21
  • .... See [new perspective at **tidy-html5**, offering a filter to HTML5-polyglot](https://github.com/htacg/tidy-html5/issues/265)! – Peter Krauss Sep 17 '15 at 18:24
0

I don't know such a tool, but based on regular expressions I think it should be possible to write your own converter using your preferred programming language. I give you an example using Java regex, but it should be transferable to PHP, too. You can test it on regexplanet.com

Given: any self-closing html tag, e.g. <textarea class="placeholder"/>

Target: tag shall be converted to <textarea class="placeholder"></textarea>

This can be achieved using a Java regular match expression like <\s*([^\s>]+)([^>]*)/\s*> on a replacement string like <$1$2></$1>. The expression finds the first word textarea in the tag, assigns it to pattern group no. 1 and all attributes on the tag to pattern group no. 2. This enables you to concat group no. 1 and 2 in the opening tag and reuse group no. 1 in the closing tag again.

Hope this helps.

Community
  • 1
  • 1
mika
  • 2,495
  • 22
  • 30
  • thanks, but you must see [w3.org/TR/html-polyglot](http://www.w3.org/TR/html-polyglot) to understand the question, is not only a question about "balancing tags", and, for more complex parse task you *must show here* the parser (the question is not "can I write a parser?" but "where the parser?"). PS: sorry the down-voting, but you are now in a *"open bounty +50"* (come back in 3 days to *edit* your question and try to return your vote). – Peter Krauss Mar 14 '15 at 12:57
  • In three days I'll probably still not know such a tool. If there is non, you'll have to go for a workaround even this is work for you. Can you limit the input or do is it mandatory to convert any HTML5? – mika Mar 14 '15 at 13:36
  • not "any HTML5", as I say "(...) or filters losing spurious information", so, the parser can be also a filter. Example: the filter can delete javascript and other non-content elements. – Peter Krauss Mar 14 '15 at 13:42
  • I read that. I ask because I wonder about the complex parse task you are talking about. What makes your content so special that parsing/transforming could be that hard? – mika Mar 14 '15 at 13:50
  • I agree with the general direction of this answer. But personally, I would start with an inspection `xmllint`'s results parsing, transforming and re-parsing (as xml) for the extent of the changes necessary and a basic test strategy. – lossleader Mar 15 '15 at 23:37
  • @lossleader, [xmllint](http://xmlsoft.org/xmllint.html)? It have "the power of" XSLT v1. Again, the question is about "where the parser??", and the tool xmllint is not a parser for this task, is only a tool... perhaps a XSLT v1 (??) can be used, but, translating again the question, *"where the source code of this XSLT?"* – Peter Krauss Mar 16 '15 at 14:39
  • @PeterKrauss why do you assume something you have never seen exists? You can use existing tools to figure out the extent of what you need to make or you can give up.. – lossleader Mar 16 '15 at 17:17