When parsing XML Data with the builtin Java (tested with jdk 8u151 and 8u161) XML processing engine I get strange results. If I am using parametric entityrefs in a DTD all following SGML Comments from the DTD end up in the output document.
This is the (minimal) code I am running:
import java.io.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;
import org.xml.sax.InputSource;
public class FormatBug {
public static void main( String[] args ) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
Reader in = new FileReader( args[0] );
Writer out = new FileWriter( args[1] );
t.transform( new SAXSource( new InputSource(in) ), new StreamResult(out) );
out.flush();
out.close();
}
}
The Source document looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doc SYSTEM "doc.dtd">
<doc><p>This is a <b>bold</b> line.</p></doc>
The DTD (doc.dtd) looks like follows:
<!ELEMENT doc (p+)>
<!ENTITY % floats "b" >
<!-- comment before -->
<!ELEMENT p ( #PCDATA | %floats; )*>
<!-- comment after -->
<!ELEMENT b (#PCDATA)>
The result looks like this:
<!-- comment after --><!DOCTYPE doc SYSTEM "doc.dtd">
<doc><p>This is a <b>bold</b> line.</p></doc>
When replaceing the rule for p into
<!ELEMENT p ( #PCDATA | b )*>
The spurious comment disappears.
Can someone explain what is going on here?
I also checked against JDK 9.0.4 where all comments are being copied, so I assume that I might be doing something entirely wrong.