-1

Say I have the following bits of XML:

<string>&#38;</string>
<string>&#x26;</string>

Is there a Java XML parsing API that will preserve them as is when reading them?

I've explored SAX, DOM, and am currently on StAX. They all convert the references before they feed me the character data.

Thanks!

Juan C Nuno
  • 476
  • 1
  • 3
  • 8
  • AFAIK, no. But why would you need to do that? – Stephen C Mar 10 '16 at 02:20
  • I'm working on the translations editor in Android Studio. People use it to manage strings in multiple languages for their Android UIs. The strings eventually get persisted in XML string resource files. I'm using XML APIs to parse the strings (which need to be valid XML) to do some escaping that our tools need. But I otherwise don't want to munge what our users enter. Maybe a particular font can't render a particular character, so they use a numeric escape for it. But if the XML API converts it back, we're back to where we started, no? – Juan C Nuno Mar 10 '16 at 02:36
  • *"Maybe a particular font can't render a particular character, so they use a numeric escape for it."* - Erm ... that wouldn't work. The codepoint will be the same whether you represent it using a character reference or a plain character (e.g. using UTF-8). You will get the same "missing font" problems in either case. If the issue is that people can't read the XML itself because of charset issues, then the solution is to control the charset that DOM / SAX / whatever uses when encoding the modified XML. – Stephen C Mar 10 '16 at 04:14
  • See also http://stackoverflow.com/questions/1777878/is-there-a-java-xml-api-that-can-parse-a-document-without-resolving-character-en?rq=1 (However, that question relates to entity references like `&aumlaut;`, wrongly referred to as character entities, rather than to numeric character references.) – Michael Kay Mar 10 '16 at 09:13
  • @StephenC Are you sure? Let's pick U+00A4 CURRENCY SIGN for the sake of argument. My font can't render that, so I replace that in my XML with "¤". I'm replacing however many bytes it takes to properly encode that character with the six bytes for the numeric reference (which my font can render). To XML, they mean the same thing. But not to the user. – Juan C Nuno Mar 10 '16 at 20:08
  • Yes, I am sure. The case you are worried about is *properly* dealt with by choosing ASCII as the character encoding when you (re-)generate the XML after it has been edited. – Stephen C Mar 10 '16 at 21:20

1 Answers1

2

To the best of my knowledge the answer is no (though proving the non-existence of a piece of software is difficult).

If this is really a requirement (and I'm sceptical), then I would suggest preprocessing the input to replace &# by, say, §#, perhaps choosing § from the Unicode private use area if you want to be ultra-cautious.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164