Encoding difficulties

Question

I'm having some encoding problems with a code I'm working on. An encrypted string is received which is decoded with ISO-8859-1. This string is then put into a DB which has UTF-8 encoding. When this string is retrieved it's still ISO-8859-1, and there's no problems. The issue is that I also need to be able to retrieve this string as UTF-8, but I haven't been successfull in this.

I've tried to convert the string from ISO to UTF-8 when retrieved from the DB using this method:

private String convertIsoToUtf8(String isoLatin) {
    try {
        return new String(isoLatin.getBytes("ISO_8859_1"), "UTF_8");
    } catch (UnsupportedEncodingException e) {
        return isoLatin;
    }
}

Unfortunately, the special characters are just displayed as question-marks in this case.

Original string: Test æøå Example output after retriving from DB and converting to UTF-8: Test ???

Update: After reading the link provided in the comment, I managed to get it right. Since the DB is already UTF-8 encoded, all I needed to do was this:

return new String(isoLatin.getBytes("UTF-8"));

search for balusC or http://balusc.omnifaces.org/2009/05/unicode-how-to-get-characters-right.html — Scary Wombat, Aug 18 '16 at 05:37
Think of a `String` as a sequence of characters independent of any encoding. Encodings only play a role if you convert a String to bytes or vice versa. Code like `new String(isoLatin.getBytes("ISO_8859_1"), "UTF_8")` does not make sense. — Henry, Aug 18 '16 at 05:57
I'm trying to find out the origins of that broken piece of code. It had [another victim](http://stackoverflow.com/questions/38890321/recover-wrongly-encoded-character-java/38890501#38890501) a few weeks ago. — Kayaman, Aug 18 '16 at 06:44
@Kayaman and it keeps popping up: http://stackoverflow.com/questions/39262555/interpret-a-string-from-one-encoding-to-another-in-java/39264158#39264158 — piet.t, Sep 01 '16 at 07:39
@piet.t I'd really love to know where that comes from. There has to be some misinformed tutorial that promotes this "solution". — Kayaman, Sep 01 '16 at 07:47

piet.t · Answer 1 · 2016-08-18T06:55:16.993

3

When you already have a String-object it is usually too late to correct any encoding-issues since some information may already have been lost - think of characters that can't be mapped one-to-one onto to java's internal UTF-16 representation.

The correct place to handle character-ecoding is the moment you get your Strings: when reading input from a file (set the correct encoding on your InputStreamReader), when converting the byte[] you got from decryption, when reading from the database (this should be handeled by your JDBC-driver) etc.

Also take care to correctly handle the encoding when doing the reverse. While it might seem to work OK most of the time when you use the default-encoding you might run into issues sooner or later that become difficult to impossible to resolve (as you do now).

P.S.: also keep in mind what tool you are using to display your output: some consoles won't display UTF-16 or UTF-8, check the encoding-settings of the editor you use to view your files etc. Sometimes your output might be correct and just can't be displayed correctly.

edited Aug 18 '16 at 06:55

answered Aug 18 '16 at 05:57

piet.t

11,718
21
43
52

1

Always specify the encoding. **Never** trust the default platform encoding. – Kayaman Aug 18 '16 at 06:53
1

@Kayaman That's what I intended to say. I changed the wording to show that using the default encoding might give you the flase impression that your application is correct while it isn't. – piet.t Aug 18 '16 at 06:57
@Kayaman well, there are cases when using the platform encoding is the correct thing to do. For example, reading or writing a text file on the target system. – Henry Aug 18 '16 at 07:30
1

@Henry That's not at all correct. Take any non-notepad text editor and you'll see they'll ask you for the encoding both when reading and writing a file. You can of course *use* the platform encoding, but then you really should specify it explicitly. You don't want a program that uses "it depends" encoding. – Kayaman Aug 18 '16 at 07:51
@Kayaman I disagree. If you write a program that is going to be used on different target systems you don't know in advance what the platform encoding is. Leaving it unspecified has a defined meaning namely to use the platform encoding. There is no "it depends" here. – Henry Aug 18 '16 at 07:54
1

@Henry The problem when writing using platform encoding is: either you don't know what the encoding is - then you can't know if it will even support the characters you are trying to write. Or you know pretty well what it should be - then why not specify it? – piet.t Aug 18 '16 at 07:55
@Henry "Does the program support Japanese characters?" "It depends on the platform.". That's not a conversation you'll have if you decide the program will always use `UTF-8` regardless of the platform encoding. It's essential *especially* when you're writing for different target systems. – Kayaman Aug 18 '16 at 08:09
@Kayaman In the other case the question is if other programs on the platform processing text files are able to deal with UTF-8. It is really a question of requirements: if you have to write in UTF-8 then by all means specify it. If you have to write in the native encoding of the platform don't do it. If some characters can't be written because they are not supported natively on the platform it is the expected result to see a replacement character. – Henry Aug 18 '16 at 08:15
@Henry Really, there's absolutely **no** excuse not to specify the encoding. Characters can be written because they're bytes. Java must support `UTF-8`, so there's no way you couldn't write UTF-8 data on any system. Using the default platform encoding is wrong, plain and simple. Like `piet.t` said, it can work perfectly and often it does, but when it doesn't work in the worst case you'll end up with corrupted data. There is **never** a valid requirement to use the platform encoding (and if there is, it should still be done by explicitly specifying it). – Kayaman Aug 18 '16 at 08:19
@Henry and of course if you're writing data that will be processed by other programs that use a different encoding, then you'll use that encoding. Come on, that's not the issue here. The issue is never use `"".getBytes();` always use `"".getBytes(wantedEncoding);`. – Kayaman Aug 18 '16 at 08:21

Encoding difficulties

1 Answers1

Linked