1

I have an XML file that states it's using utf-8. When I open the file in VIM, I see something like

<?xml version="1.0" encoding="UTF-8"?> 
<r>
  <first-tag>foo</first-tag>
  <second-tag>
     &lt;a-tag-nested-in-second-tag&gt;some data&lt;/a-tag-nested-in-second-tag&gt;
  </second-tag>
  ...
</r>

I'm using Java 1.6.0_41's SAXParser and while consuming this data, the parser basically doesn't see the malformed literals and skips over them or seems to treat the malformed chars as "content" data for second-tag.

Here's how I'm consuming data,

File f = ...
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
stream = new FileInputStream(f);
AbstractHandler handler = ...
parser.parse(new InputSource(stream), handler);

Is there a way for SAX to treat the nested escaped XML data as truly XML markup and not merely data as-is for second-tag?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
jeemar
  • 548
  • 5
  • 15
  • 1
    The file is either malformed or is deliberately including data as-is that looks like (and probably is) XML, but is not XML *for the purposes of the outer file*. SAX appears to be doing things exactly right. – Nathan Tuggy Jun 13 '15 at 01:54
  • Thanks Nathan, I've edited my post in hopes of getting closer to what I really want to ask. – jeemar Jun 13 '15 at 14:48

1 Answers1

1

UTF-8 is a character encoding. It wouldn't make sense to have multiple character encodings in a single file, nor do you show any evidence of having multiple character encodings.

What you do show are multiple character entity references such as &lt; and &gt;. These are not a problem, although they may indicate (intentional or accidental) escaped output of XML markup.

What is a problem is that your "XML" lacks a single root element and is therefore not well-formed.

If you give your markup a single root element,

<?xml version="1.0" encoding="UTF-8"?>
<r>
  <first-tag>foo</first-tag>
  <second-tag>
    &lt;a-tag-nested-in-second-tag&gt;some data&lt;/a-tag-nested-in-second-tag&gt;
  </second-tag>
</r>

an XML parser will be able to parse it just fine.


Update per comments and updated question

Is there a way for SAX to treat the nested escaped xml data as truly xml markup and not merely data as-is for "second-tag"?

No, there's not a simple configuration flag that'll direct SAX to treat escaped XML as regular XML. SAX will rightly see the escaped XML data as the characters and character entity references that it is. Your options include fixing the problem upstream by

  1. eliminating the escaping of the XML you wish to preserve, or
  2. post-processing the escaped XML data to re-establish the original XML.

Note that option #2 might itself involve a SAX-based parser whose entity handlers you've designed to rebuild the original XML.

See also how to unescape XML in java.

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks @kjhughes, I forgot to mention that the file does have the proper single root element. So in this case, the parser is correctly parsing the "a-tag-nested-in-second-tag" as content data for "second-tag" and not seeing it as part of the structure as was intended. In both our examples and not including initial xml declaration and root element, there should really be 3 xml tags, but the parser is only consuming it as 2 tags with escaped xml data as content data for the second-tag. SAX is parsing it "fine", but the intent of the file is lost because if the escaped xml structure. – jeemar Jun 13 '15 at 14:35
  • I've edited my post in hopes of getting closer to what I really want to ask. Your answer was very helpful. – jeemar Jun 13 '15 at 14:47
  • Thanks! That was very helpful and I can now close this post. – jeemar Jun 13 '15 at 15:47