1

I have a problem with sax parser and encoded text. I try to parse RSS in ISO-8859-2 (http://www.sbazar.cz/rss.xml?keyword=pes) this way:

InputStream responseStream = connection.getInputStream();
Response response = mRequest.createResponse();

Reader reader = new InputStreamReader(responseStream);
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-2");

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(is, response);

but parser returns strings with strange symbols. I tried a lot of things, but nothing helped me :( Can somebody help me please?

enter image description here

petrnohejl
  • 7,581
  • 3
  • 51
  • 63
  • Try it with UTF-8, that's at least what my browser says what the encoding is. Or you might need to read the encoding from the response header if that is present. You can also set the encoding in the InputStreamReader, maybe it needs to be in both – zapl Mar 26 '12 at 21:43
  • I tried UTF-8, but it still returns strange symbols. I also tried to set encoding in the InputStreamReader with no effect. Response header is: HTTP/1.1 200 OK Date: Mon, 26 Mar 2012 20:19:21 GMT Server: Apache Vary: Accept-Encoding Content-Type: application/rss+xml Transfer-Encoding: chunked – petrnohejl Mar 26 '12 at 21:54

3 Answers3

2

Have you tried setting the charset of the InputStreamReader:

Reader reader = new InputStreamReader(responseStream, Charset.forName("ISO-8859-2"));
InputSource is = new InputSource(reader);

The InputStreamReader(InputStream) constructor, if you don't specify the charset, uses the default charset (which in my machine is windows-1252).

So in your current set up, the bytes are being interpreted as (probably) windows-1252 characters, after which i don't think you can re-interpret them as ISO-8859-2.

Chris White
  • 29,949
  • 4
  • 71
  • 93
1

Sax is able to autodetect the encoding if it's given an input stream, not a reader.

InputSource is = new InputSource(responseStream)

Probably in your case you wanted a hardcoded encoding and you got the answer on how to do it. But I was looking for a general solution and found one here: Howto let the SAX parser determine the encoding from the xml declaration?

Documentation: InputSource in java 5 (note that java 1.4 documentation lacks the crucial sentence). autodetecting the character encoding using an algorithm such as the one in the XML specification. That refers to byte stream, but not to character stream (Reader)

As I was digging more in XML documentation (Autodetection of Character Encodings), I found an explanation of the difference between treating Reader and Stream. To apply all of the encoding algorithms Sax must have access to raw stream, not converted to characters, because the conversion could corrupt byte markers.

Community
  • 1
  • 1
Jarekczek
  • 7,456
  • 3
  • 46
  • 66
0

Finally, I solved my problem using Rome library. It works well also with ISO-8859-2. Here is the source code, how to use Rome:

String urlstring = "http://www.sbazar.cz/rss.xml?keyword=pes";
InputStream is = new URL(urlstring).openConnection().getInputStream();
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = (SyndFeed)input.build(new InputStreamReader(is, Charset.forName("ISO-8859-2")));

Iterator entries = feed.getEntries().iterator();
while (entries.hasNext())
{
    SyndEntry entry = (SyndEntry)entries.next();
    Log.d("RSS", "-------------");
    Log.d("RSS", "Title: " + entry.getTitle());
    Log.d("RSS", "Published: " + entry.getPublishedDate());

    if (entry.getDescription() != null) 
    {
        Log.d("RSS", "Description: " + entry.getDescription().getValue());
    }
    if (entry.getContents().size() > 0) 
    {
        SyndContent content = (SyndContent)entry.getContents().get(0);
        Log.d("RSS", "Content type=" + content.getType());
        Log.d("RSS", "Content value=" + content.getValue());
    }
} 
petrnohejl
  • 7,581
  • 3
  • 51
  • 63