XML Canonical form in Java

Question

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.

Have the following test code:

// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += "         <ccc/></aaa>";

// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));

// Normalize document
// Do I realy need to do this?
doc.normalize();

// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
        Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
        .canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));

// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());

// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
    "{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);

// And print it
System.out.println(xmlOutput.getWriter().toString());

Executing this code, would output:

<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
         <ccc/>
</aaa>

Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.

Having such an example, I have a few questions:

For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".

Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

XML Canonical form in Java

0 Answers0