1

I have a java program which handles xml files. Those files are in S1000D format, used for technical documentation. I need to update some meta data in the xml files and I am using SAXON to do so.

But Saxon is doing more transformations than the ones in my xsl.

  • It auto closes the empty tags
  • it interprets the HTML entities contained in the file.

Here is an extract of one of my input file :

<dmodule xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.s1000d.org/S1000D_4-1/xml_schema_flat/schedul.xsd">
...
    <reqSpares>
        <noSpares></noSpares>
    </reqSpares>
    <reqSafety>
        <noSafety></noSafety>
    </reqSafety>
...
    <timeLimit>
        <remarks>
            <simplePara>Lorem ipsum</simplePara>
            <simplePara>Lorem ipsum dolor sit amet, consectetur adipiscing elit.&#xA;Vestibulum pulvinar sapien at lacus lacinia,&#xA;eu maximus arcu vestibulum.</simplePara>
        </remarks>
    </timeLimit>
...

And here is the result of my transformation:

<dmodule xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.s1000d.org/S1000D_4-1/xml_schema_flat/schedul.xsd">
...
    <reqSpares>
        <noSpares/>
    </reqSpares>
    <reqSafety>
        <noSafety/>
    </reqSafety>
...
    <timeLimit>
        <remarks>
            <simplePara>Lorem ipsum</simplePara>
            <simplePara>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vestibulum pulvinar sapien at lacus lacinia,
eu maximus arcu vestibulum.</simplePara>
        </remarks>
    </timeLimit>
...

Even if my xsl does not transform anything on those lines, they are transformed like so.

My requirements are that I do not have the permission to alter in whatsoever reason the structure or the content of the xml I am transforming like it is done in this example. The service that provides the input does not want to edit the input and add the entity declaration at the start of the xml file or encapsulate the html entities inside a CDATA tag.

In Saxon, we have tried:

  • change encoding to US-ASCII
  • replace & translate methods but as it is not on our transformed nodes, it does not work
  • disable-encoding but as above, the changes are not done on our xsl transformations.

I also have looked into BaseX too but the problem is the same, and I am not an expert enough in this library to find if it is possible to achieve the behavior.

Any help would be appreciated !

Damounet
  • 96
  • 2
  • 9
  • 1
    XSLT uses an XML parser to parse your input XML markup into an Xdm tree representation that does not preserve any lexical details like `` versus `` or whether a character was represented in a certain encoding or as a character or entity reference. It is not clear what you use XSLT for but the way it works no XSLT processor will preserve the markup in the input, it is parsed into a tree, that tree is transformed into a result tree which is optionally serialized back to markup. On that way there will be no way to preserve the markup details you say you want to preserve. – Martin Honnen Dec 14 '20 at 16:39
  • 1
    Depending on the API you use there might be ways to influence the serialization of the result tree and to e.g. ensure empty elements are serialized as e.g. `` instead of `` but that is an output serialization then done, not a preservation of input markup details. And such details often require you to add Java or .NET code to set up that serialization detail. – Martin Honnen Dec 14 '20 at 16:43
  • I see, there is no easy way to achieve that. Maybe we have to achieve our transformation using regex? haha – Damounet Dec 14 '20 at 16:50
  • https://stackoverflow.com/a/1732454 – michael.hor257k Dec 14 '20 at 18:37
  • I wonder what editor software does to represent XML or other formats and allow you to manipulate parts of it, I would guess they have tree representations as well but probably with more details than the standard DOM or XDM tree model. So it might make sense to use at the APIs and data structures an editor uses to represent and manipulate XML if you need to preserve such details. – Martin Honnen Dec 14 '20 at 20:23
  • I work for an aircraft manufacturer and we edit the technical documentation of the aircrafts. This is done following a standard called S1000D. Our software components are part of the integration of this documentation and get it from various other entities / services. We have to update some meta data inside the data structure but not the real content of it describing technical stuff because this document is certified by aeronautics authorities. – Damounet Dec 15 '20 at 07:48
  • I was already aware that it was nearly impossible to achieve but wanted to be sure it was. And with your inputs and @michael-kay answer I have better arguments to change the mind of our stack-holders and find a more suitable solution for theirs needs. – Damounet Dec 15 '20 at 07:49

1 Answers1

2

Distinctions like the difference between <foo/> and <foo></foo> are lost by the time the data has been parsed (similarly, the use of single vs double quotation marks around attributes, whitespace within start and end tags, etc), and XML parsers don't provide any way of disabling expansion of entity references. Since XSLT operates on the output of an XML parser, if an XSLT processor doesn't see such distinctions then it can't preserve them.

Keeping entity references intact is a perfectly reasonable requirement, and my usual workaround is to use a text editor to globally replace & with § (after first checking that § doesn't appear in the file, of course) and then reverse the process on completion.

Keeping the exact lexical form of start and end tags is a much more questionable requirement. If you're being asked to do this, then the requirement is coming from someone who doesn't understand XML. Saxon gives you a lot of control over how the output is serialized (for example the serialization option saxon:canonical="yes" prevents use of empty element tags in the result), but it doesn't allow you to preserve whatever was in the input. If you're being told that's the requirement, then you need to ask "why" and "how much are you prepared to pay for this" - it will add greatly to your costs because you can forget all off-the-shelf XML processing libraries.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164