0

I get an input file hosting elements like

<item>
<Description>
    Intro 1
    &lt;b&gt;Title&lt;/b&gt;
    Intro 2
    &lt;ul&gt;
    &lt;li&gt;item 1&lt;/li&gt;
    &lt;li&gt;&lt;b&gt;item 2&lt;/b&gt;&lt;/li&gt;
    &lt;/ul&gt;
    Finish
</Description>
</item>

I would like to create an xslt2 template or function converting this to a node() like

<item>
<Description>
    Intro 1
    <b>Title</b>
    Intro 2
    <ul>
    <li>item 1</li>
    <li><b>item 2</b></li>
    </ul>
    Finish
</Description>
</item>

to process it further.

Any recommendation how to achieve this?

ngong
  • 754
  • 1
  • 8
  • 23
  • Does this answer your question? [How to unescape XML characters with help of XSLT?](https://stackoverflow.com/questions/2463155/how-to-unescape-xml-characters-with-help-of-xslt) – Progman Dec 22 '20 at 14:48
  • I played around with disable-output-escaping. Not really successful yet. Maybe the second solution - though looking tough - promisses to do what I am looking for. I will try it. – ngong Dec 22 '20 at 15:19

1 Answers1

1

David Carlisle implemented an HTML parser in XSLT 2, you can find it at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl and use it as e.g.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:d="data:,dpc"
    exclude-result-prefixes="#all"
    version="3.0">
    
  <xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="Description">
      <xsl:copy>
          <xsl:apply-templates select="d:htmlparse(., '', true())/node()"/>
      </xsl:copy>
  </xsl:template>
  
</xsl:stylesheet>

to get a result like

<item>
<Description>
    Intro 1
    <b>Title</b>
    Intro 2
    <ul>
    <li>item 1</li>
    <li><b>item 2</b></li>
    Finish
</ul></Description>

If the input were well-formed XML you could also use XSLT 3/XPath 3's parse-xml-fragment function but without the closing </ul> your sample can't be parsed as XML.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • thank you - the last was missing in my text -- corrected – ngong Dec 22 '20 at 16:52
  • used htmlparser.xsl - it interpreted to give me separate lines, but stripped off any other tag. However, it is good enough for now - I will dig into it later. – ngong Dec 22 '20 at 17:18
  • https://xsltfiddle.liberty-development.net/bEJbVrt/1 is your edited sample with the closing tag parsed in XSLT 3 and `parse-xml-fragment`, https://xsltfiddle.liberty-development.net/bEJbVrt/0 is your original sample parsed with `d:html-parse`. Not sure where your effort failed to strip off any other tag, perhaps if the problem persists raise that as a separate question with the details of the minimal but complete code you have and the XSLT processor you used. – Martin Honnen Dec 22 '20 at 17:49
  • Thank you Martin, pointing me to parse-cml-fragment. That was the best solution. It works with Saxon 10.3, even though I selected version="2.0" in the xsl:stylesheet. That way I did not look for my problem with htmlparse. – ngong Dec 23 '20 at 13:28
  • Saxon 9.8 and later are XSLT 3 processors, the function library supported is not in any way restricted by selecting `version="2.0"` for code you process with them. There is a backwards compatibility mode for `version="1.0"` but that is also not restricting the function library. – Martin Honnen Dec 23 '20 at 14:18