30

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);

Since SAX defaults to UTF-8 this is fine. However some of the documents declare:

<?xml version="1.0" encoding="ISO-8859-1"?>

Even though ISO-8859-1 is declared SAX still defaults to UTF-8. Only if I add:

is.setEncoding("ISO-8859-1");

Will SAX use the correct encoding.

How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.

Thanks in advance, Allan

Allan
  • 549
  • 1
  • 4
  • 9

2 Answers2

18

Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.

If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.

Why? Because autodetection encoding algorithms require raw data, not converted to characters.

The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.

Jarekczek
  • 7,456
  • 3
  • 46
  • 66
  • 3
    Yes: the _key point_ is that SAX will detect the encoding from the `` PI _only_ if the `InputSource` is constructed from an `InputStream` instance; it won't work if constructed from a `Reader` (because the point of a `Reader` is that its output is 'post-decoding'). That is: `new InputSource(getInputStream())` is correct. – Norman Gray Jul 03 '14 at 11:10
  • On a side note, is there any library which parses just the XML declaration using the algorithms above? I am asking because I can't use Sax directly but I would like to extract the encoding info from my xmls. – Andrea Richiardi Oct 02 '14 at 17:02
  • This should be the accepted solution. InputStream has no encoding information, so SAX determines the encoding itself by trying to read the encoding attribute from the XML file. This also works when working with the XsltTransformer. – phobic Aug 25 '16 at 10:01
  • Is there any possibility to get the exact content of the attribute "encoding" of the xml prologue? Xerces locator doesn't work. – Kuronashi Dec 20 '19 at 09:48
8

I found the answer myself.

The SAX parser uses InputSource internally and from the InputSource docs:

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);
Allan
  • 549
  • 1
  • 4
  • 9
  • 15
    Constructing an InputStreamReader without specifying a charset will use the default charset of your machine, which probably is iso-8859-1. As you quoted, the encoding decl in the xml will be ignored when using a characterstream so this code will only work with iso-8859-1 documents. You original code should actually have worked, maybe you could add the exception or the exact problem you are seeing to your question. When using a byte stream and not setting the encoding on the InputSource the xml parser should autodetect the encoding as described in http://www.w3.org/TR/REC-xml/#sec-guessing. – Jörn Horstmann Aug 14 '10 at 10:38
  • Basically I get an invalid token exception if I don't use "is.setCharacterStream()". – Allan Aug 15 '10 at 20:57
  • 4
    This may have worked for you, but Jörn is right. The documentation you referenced is relevant and correct. And it tells you that the original code with InputStream was correct. The bug is in the document itself. If you use a workaround like overriding the encoding or autodetecting it some other way than the XML spec, as you are doing with InputStreamReader, you should document that fact. – John Watts Jun 21 '12 at 11:11
  • This almost tricked me due to the upvotes. I'm glad I decided to write additional tests. As for him, it just so happened that the specific item I was parsing needed to be in Windows-1252 (and I am on Windows)... but some files are also UTF-8. – shaddow Apr 15 '23 at 00:04