reading text files of different encodings in java

Question

If I had a file encoded in ISO but wanted to read the file as UTF-8 using java would I still get the same text?

would special characters such as µÃÿ display the same?

score 1 · Answer 1 · answered Sep 19 '12 at 23:06

1

No, you would not. UTF-8 does not encode characters beyond U+007f in the same way as ISO-8859-1 (ISO-8859-1 encodes U+0080 through U+00ff as single bytes \x80 to \xff, while UTF-8 uses two bytes for each of those characters).

You have to use an explicit encoding specification when opening the file: new InputStreamReader(new FileInputStream(...), <encoding>)

answered Sep 19 '12 at 23:06

nneonneo

171,345
36
312
383

On the Internet, just saying "ISO encoding" suggests ISO Latin 1 encoding, since that's the older (more popular) encoding. OP should clarify, to be sure. – nneonneo Sep 19 '12 at 23:09
Can you back that up? Why and where is it the most popular? – nullpotent Sep 19 '12 at 23:11
Google search. Googling 'ISO encoding' results in no mention of 10646 on the first page at all. Same with "ISO encoding". – nneonneo Sep 19 '12 at 23:16

score 0 · Answer 2 · edited May 23 '17 at 10:25

0

In short, no. The way the characters are represented (bitwise) in ISO is not the same as how characters are represented in UTF-8.

However, you can convert a file from ISO to UTF-8, but not UTF-8 to ISO, because there are many more recognizable characters in UTF-8 than there are in ISO.

My recommendation would be to detect the encoding (see: Java : How to determine the correct charset encoding of a stream) and then to handle each case accordingly.

edited May 23 '17 at 10:25

Community

1
1

answered Sep 19 '12 at 23:11

alvonellos

1,009
1
9
27

If you can at all avoid using a character detection library, though, then you should. Character detection isn't 100%, and can lead to various weird issues when it gets the answer wrong. – nneonneo Sep 19 '12 at 23:18

reading text files of different encodings in java

2 Answers2