3

This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.

Have the following test code:

// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += "         <ccc/></aaa>";

// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));

// Normalize document
// Do I realy need to do this?
doc.normalize();

// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
        Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
        .canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));

// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());

// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
    "{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);

// And print it
System.out.println(xmlOutput.getWriter().toString());

Executing this code, would output:

<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
         <ccc/>
</aaa>

Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.

Having such an example, I have a few questions:

  • For my intent, is there any difference between .normalize() and Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml).
  • Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is <aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa> the xml is perfectly formatted.
  • Why after asking for the canonical form, tags such as <ccc/> weren't expanded to <ccc></ccc>? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".

Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.

Community
  • 1
  • 1
filippo
  • 5,583
  • 13
  • 50
  • 72

0 Answers0