7

I want to parse an XML file from URL using JDOM. But when trying this:

SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);

I get this exception:

Invalid byte 1 of 1-byte UTF-8 sequence.

I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM. I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.

I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.

I appreciate any help on the possible cause of this issue.

doctrey
  • 1,227
  • 1
  • 12
  • 25
  • Can you upload the offending file somewhere? – millimoose Dec 12 '11 at 21:32
  • Does SAXBuilder have a known bug with BOMs in UTF-8? XML parsers should handle them without error. Either way, from that description I'd be more inclined to suspect it's not UTF-8 at all. – Jon Hanna Dec 12 '11 at 21:38
  • @JonHanna Don't know about SAXBuilder. I couldn't find anything pointing to problem with SAXBuilder. But about second point the file states that it's UTF-8 in it's prolog. Also when I try to view it in any other encodings the BOM in the beginning appears. – doctrey Dec 12 '11 at 21:55
  • For the Bom problem, you can take a look [here](http://stackoverflow.com/questions/5353783/why-org-apache-xerces-parsers-saxparser-does-not-skip-bom-in-utf8-encoded-xml/5354030#5354030) – javanna Dec 12 '11 at 22:35

2 Answers2

4

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:

private InputStream openStream(final URL url) throws IOException
{
    final URLConnection cxn = url.openConnection();
    final String contentEncoding = cxn.getContentEncoding();
    if(contentEncoding == null)
        return cxn.getInputStream();
    else if(contentEncoding.equalsIgnoreCase("gzip")
               || contentEncoding.equalsIgnoreCase("x-gzip"))
        return new GZIPInputStream(cxn.getInputStream());
    else
        throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use:

builder.build(openStream(aUrl.openStream()));

. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.

See the documentation for java.net.URLConnection.

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Dude thanks that solved the problem. You have no Idea how much you helped me. One question though: If I use GzipInputStream to wrap any input stream, will that cause any problem with the ones that are not gzipped? – doctrey Dec 12 '11 at 22:38
  • I tested it and yes it does make problems. It throws IOException if the stream is not in Gzip format. – doctrey Dec 12 '11 at 22:53
  • @doctrey: You're welcome! Re: non-GZIPped streams: Yeah, that would be a problem, since `GZIPInputStream` requires that its input be GZIPped. I've edited my answer to give (untested) code to handle both cases. – ruakh Dec 12 '11 at 22:54
0

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html: "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". That seems to be occurring here.

Danny Thomas
  • 1,879
  • 1
  • 18
  • 32