Scala - Converting from ISO-8859-1 to UTF-8 gives foreign character strangeness

Question

Here's my problem; I have an InputStream that I've converted to a byte array, but I don't know the character set of the InputStream at runtime. My original thought was to do everything in UTF-8, but I see strange issues with streams that are encoded as ISO-8859-1 and have foreign characters. (Those crazy Swedes)

Here's the code in question:

IOUtils.toString(inputstream, "utf-8")
// Fails on iso8859-1 foreign characters

To simulate this, I have:

new String("\u00F6")
// Returns ö as expected, since the default encoding is UTF-8

new String("\u00F6".getBytes("utf-8"), "utf-8")
// Also returns ö as expected.

new String("\u00F6".getBytes("iso-8859-1"), "utf-8")
// Returns \uffff, the unknown character

What am I missing?

If you don't know the encoding of the (ostensible) characters encoded within the `InputStream`, you cannot turn it into characters. It's just that simple. And... Why would you expect that encoding to ISO-8859-1 and then decoding from UTF-8 would work for arbitrary characters? — Randall Schulz, Feb 06 '13 at 03:36
Nit: `new String("\u00F6")` having a value as expected has *nothing* to do with encoding .. — , Feb 06 '13 at 04:35
Determining the encoding at runtime is the reason `Content-Type` headers and their respective `charset` parameters exist — Kristian Domagala, Feb 06 '13 at 06:09
This is not just a swedish letter, but also a german umlaut. :) — Madoc, Feb 06 '13 at 09:11
To be extra clear it is the `"utf-8"` arg (in `new String("\u00F6".getBytes("iso-8859-1"), "utf-8")`) that causes the problem - the call `System.out.println(new String("\u00F6".getBytes("iso-8859-1")));` would very nicely print `ö` — Mr_and_Mrs_D, Apr 11 '13 at 11:46

score 4 · Answer 1 · answered Feb 06 '13 at 03:49

4

Not all sequence of bytes are valid UTF-8 characters. Some sequences of bytes are not valid, and by converting \u00F6 into it's equivalent latin-1 character, you produced something that is not valid UTF-8.

answered Feb 06 '13 at 03:49

Daniel C. Sobral

295,120
86
501
681

score 1 · Accepted Answer · answered Feb 06 '13 at 10:35

You should have the source of the data telling you the encoding, but if that cannot happen you either need to reject it or guess the encoding if it's not UTF-8.

For western languages, guessing ISO-8859-1 if it's not UTF-8 is probably going to work most of the time:

ByteBuffer bytes = ByteBuffer.wrap(IOUtils.toByteArray(inputstream));
CharBuffer chars; 

try {
    try {
        chars = Charset.forName("UTF-8").newDecoder().decode(bytes);
    } catch (MalformedInputException e) {
        throw new RuntimeException(e);
    } catch (UnmappableCharacterException e) {
        throw new RuntimeException(e);
    } catch (CharacterCodingException e) {
        throw new RuntimeException(e);
    }
} catch (RuntimeException e) {
    chars = Charset.forName("ISO-8859-1").newDecoder().decode(bytes);
} 
System.out.println(chars.toString());

All this boilerplate is for getting encoding exceptions and being able to read the same data multiple times.

You can also use Mozilla Chardet that uses more sophisticated heuristics to determine the encoding if it's not UTF-8. But it's not perfect, for instance I recall it detecting Finnish text in Windows-1252 as Hebrew Windows-1255.

Also note that arbitrary binary data is valid in ISO-8859-1 so this is why you detect UTF-8 first (It is extremely like that if it passes UTF-8 without exceptions, it is UTF-8) and which is why you cannot try to detect anything else after ISO-8859-1.

Scala - Converting from ISO-8859-1 to UTF-8 gives foreign character strangeness

2 Answers2

Linked