0

When parsing XML Data with the builtin Java (tested with jdk 8u151 and 8u161) XML processing engine I get strange results. If I am using parametric entityrefs in a DTD all following SGML Comments from the DTD end up in the output document.

This is the (minimal) code I am running:

import java.io.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;

import org.xml.sax.InputSource;

public class FormatBug {

    public static void main( String[] args ) throws Exception {
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        Reader in = new FileReader( args[0] );
        Writer out = new FileWriter( args[1] );
        t.transform( new SAXSource( new InputSource(in) ), new StreamResult(out) );
        out.flush();
        out.close();
    }
}

The Source document looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc><p>This is a <b>bold</b> line.</p></doc>

The DTD (doc.dtd) looks like follows:

<!ELEMENT doc (p+)>
<!ENTITY % floats "b" >
<!-- comment before -->
<!ELEMENT p ( #PCDATA | %floats; )*>
<!-- comment after -->
<!ELEMENT b (#PCDATA)>

The result looks like this:

<!-- comment after --><!DOCTYPE doc SYSTEM "doc.dtd">
<doc><p>This is a <b>bold</b> line.</p></doc>

When replaceing the rule for p into

<!ELEMENT p ( #PCDATA | b )*>

The spurious comment disappears.

Can someone explain what is going on here?

I also checked against JDK 9.0.4 where all comments are being copied, so I assume that I might be doing something entirely wrong.

Meriadox
  • 21
  • 3

1 Answers1

0

I can confirm this happening on JDK 1.8.0_151, and consider it a problem due to using SAXSource as input source for transformation, because Java's javax.xml.parsers.SAXParser ignores comments.

The following variant using StAX doesn't print spurious comments on JDK 1.8 so might help in achieving to get uniform Java source running on both JDK 1.8 and 1.9:

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.*;
import javax.xml.transform.stream.*;

public class FormatBugUsingStaX {

    public static void main(String[] args) throws Exception {

        InputStream inputStream = new FileInputStream(args[0]);
        InputStreamReader in = new InputStreamReader(inputStream);
        XMLInputFactory factory = XMLInputFactory.newInstance();
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        XMLStreamReader streamReader = factory.createXMLStreamReader(in);
        Writer out = new FileWriter(args[1]);
        t.transform(new StAXSource(streamReader), new StreamResult(out));
    }
}

Edit: If your intention is to keep comments, you might have luck by using another StAX implementation; cf. Transforming a StAX Source in Java

imhotap
  • 2,275
  • 1
  • 8
  • 16