Why ISO-8859-7 to UTF-8 encoding fails while the reverse is successful

Question

Well, I was trying to read a text file encoded in ISO-8859-7 and save it in UTF-8 or vice versa since the text file contains Greek/Latin text. I realised that it wasn't so easy (as stated in this question).

But I also noted that when I read my text file encoded in UTF-8 and try to save it to ISO-8859-7 it actually works as supposed (writing readable characters in the text file). On the other hand, when the opposite case is true, reading ISO-8859-7 and writing UTF-8 then the outcome is not the expected one.

So, my question is why the above occurs? I know I should have followed the approach in the question so I don't need an answer about how to make the encoding work. Does it have to do with the fact that UTF-8 defines more characters than ISO-8859-7?

I am using the following code to accomplish this:

BufferedReader reader = BufferedReader(new InputStreamReader(new FileInputStream(file), encoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), encoding));

where encoding is just a String representing the encoding.

score 0 · Answer 1 · answered Jun 04 '15 at 20:17

0

How did you verify it works or doesn't work? Did you check actual bytes getting written to ensure they encode expected characters?

A common mistake is to just use command-line tools to eyeball contents -- this assumes that the tool knows actual encoding, and does not just guess it is one specific one. Specifically, in your case, it may well be just defaulting to ISO-8859-7 (or, -1) for viewing (or possibly UTF-8) so that tool is incorrectly decoding characters from bytes, giving impression of failure.

answered Jun 04 '15 at 20:17

StaxMan

113,358
34
211
239

I actually examine failure/success by inspecting it in a text editor (Kate to be specific). I did not examine the actual bytes though. – Eypros Jun 04 '15 at 20:59
Ok. Just wanted to mention this as it has bitten me before. Otherwise, yes, ISO-encoding only supports 256 characters from the full Unicode set, whereas UTF-8 can express all characters. But I assume you don't have any such characters in text, and if so, transcoding should just work. So the most likely explanation would still be a mismatch between what encoding file uses, and what decoder (Reader) uses for decoding. – StaxMan Jun 04 '15 at 21:02

Why ISO-8859-7 to UTF-8 encoding fails while the reverse is successful

1 Answers1