This question got me pretty close and actually works. Now I'm trying to understand it better and make it more robust.
Have the following test code:
// Just build a test xml
String xml;
xml = "<aaa Batt = \"That\" Aatt=\"this\" >\n";
xml += "<!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/>\n";
xml += " <ccc/></aaa>";
// do the necessary bureaucracy
DocumentBuilder docBuilder;
docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = docBuilder.parse(new ByteArrayInputStream(xml.getBytes()));
// Normalize document
// Do I realy need to do this?
doc.normalize();
// Canonize using Apache's Xml security
org.apache.xml.security.Init.init(); // Doesnt work if I don't do this.
byte[] c14nOutputbytes = Canonicalizer.getInstance(
Canonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS)
.canonicalizeSubtree(doc.getDocumentElement());
// This was a reparse reccomended to get attributes in alpha order
Document canon = docBuilder.parse(new ByteArrayInputStream(c14nOutputbytes));
// Input and output for the transformer
DOMSource xmlInput = new DOMSource(canon);
StreamResult xmlOutput = new StreamResult(new StringWriter());
// Configure transformer and format code
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(
"{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(xmlInput, xmlOutput);
// And print it
System.out.println(xmlOutput.getWriter().toString());
Executing this code, would output:
<aaa Aatt="this" Batt="That">
<!-- Document comment --><bbb lolol="dsf" moarttt="fasf"/>
<ccc/>
</aaa>
Which might be canonized, but doesn't seem to respect the indentation I asked the transformer to do.
Having such an example, I have a few questions:
- For my intent, is there any difference between
.normalize()
andCanonicalizer.ALGO_ID_C14N_EXCL_WITH_COMMENTS
? Removing either of them seems to yield the same result (again within my intent of have a canonical and pretty printed xml). - Why do the blank spaces within the xml seem to screw the formatting? Would I have to trim the text of each xml node to make it work? It just sounds wrong, nonetheless if the input xml is
<aaa Batt = \"That\" Aatt=\"this\" ><!-- Document comment --><bbb moarttt=\"fasf\" lolol=\"dsf\"/><ccc/></aaa>
the xml is perfectly formatted. - Why after asking for the canonical form, tags such as
<ccc/>
weren't expanded to<ccc></ccc>
? Wikipedia says "empty elements are encoded as start/end pairs, not using the special empty-element syntax".
Sorry if these are too many questions at once, but I have the feeling the answers for all of these should be somewhat the same.