0

I'm writing a program in Java that it's going to read a XML file and do some modification,and then write the file with the same format.

The following is the code block that reads and writes the XML file:

        final Document fileDocument = parseFileAsDocument(file);

        final OutputFormat format = new OutputFormat(fileDocument);

        try {
            final FileWriter out = new FileWriter(file);
            final XMLSerializer serializer = new XMLSerializer(out,format);
            serializer.serialize(fileDocument);
        }
         catch (final IOException e) {
            System.out.println(e.getMessage());
        }

This is the method used to parse the file:

private Document parseFileAsDocument(final File file) {
    Document inputDocument = null;
    try {
        inputDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
    }//catching some exceptions{}

    return inputDocument;
}

I'm noticing two changes after the file is written: Before I had a node similar to this:

<instance ref='filter'>
 <value></value>
</instance>

After reading and writing, the node looks like this:

<instance ref="filter">
 <value/>
</instance>

As you can see from above, the 'filter' has been changed to "filter" with double quote. The second change is <value></value> has been changed to <value/>. This change happens across the XML file whenever we have a node similar to <tag></tag> with no value in between. So if we have something like <tag>somevalue</tag>, there is no issue. Any thought please how to get the XML nodes format to be the same after writing? I'd appreciate it!

Zak
  • 9
  • 1
  • 7
  • Take a look at this: http://stackoverflow.com/questions/3884876/how-to-create-an-xml-text-node-with-an-empty-string-value-in-java – Bruno Ribeiro Jan 15 '15 at 20:20

1 Answers1

0

You can't, and you shouldn't try. It's a bit like complaining that when you add 0123 and 0234, you get 357 without the leading zeroes. Leading zeroes in integers aren't considered significant, so arithmetic operations don't preserve them. The same happens to insignificant details of your XML, like the distinction between double quotes and single quotes, and the distinction between a self-closing tags and a start/end tag pair for an empty element. If any consumer of the XML is depending on these details, they need to be sent for retraining.

The most usual reason for asking for lexical details to be preserved is that you want to detect changes. But this means you are doing your comparisons the wrong way: you should be comparing at the logical level, not the physical level. One way to do comparisons is to canonicalize the XML, so whenever there is an arbitrary choice to be made between equivalent representations, it is made the same way.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thank you so much for your great answer and I totally agree with you! The reason why I was worrying about these changes is because I'm modifying some XML files that are generated by some IDE, and I'm doing this reading and writing to do some modification on them. So I was worrying that the IDE would not be able to interpret these files correctly after things have been changed from single quote to double quote and from to . But after some initial testing, everything seems to be working fine. – Zak Jan 15 '15 at 23:27
  • Also, these files don't .xml extension, they don't have any file extension but after I open them, I can see ' so I was thinking that they are some specific XML files only for that IDE. – Zak Jan 15 '15 at 23:33
  • Don't worry about it. It's perfectly OK for XML files to have a file extension other than .xml, and it's perfectly OK (indeed, it's recommended) for them to start with an XML declaration on the first line. – Michael Kay Jan 16 '15 at 09:29