Why can't Java read this unicode char in UTF-8 file?

Question

"unicode.txt" UTF-8 file

アफਸᙡşüabÇİÜ⏩ア

The first character has 4 bytes. And when I run this code, I can't get the output that I expect

InputStream in = new FileInputStream("unicode.txt");
InputStreamReader inReader = new InputStreamReader(in, "UTF-8");
char ch = (char)inReader.read();
System.out.println(ch); // Writes '?' character to the console. Why ?

Why this code doesn't write '' character to the console ? And How can I write it ?

My default encoding:

System.out.println(System.getProperty("file.encoding")); // output: "UTF-8"
System.out.println(Charset.defaultCharset()); // output: "UTF-8"

I think, the problem is char data type.

Thanks.

Jared Stewart · Accepted Answer · 2018-07-01T00:26:56.880

The char data type is based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of Unicode code points is now U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is called the basic multilingual plane (BMP), and characters whose code points are greater than U+FFFF are called supplementary characters. A char value, therefore, represents BMP code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points.

In particular, do not write code that assumes that a value of the primitive type char (or a Character object) fully represents a Unicode code point.

(From https://wiki.sei.cmu.edu/confluence/plugins/servlet/mobile?contentId=88487813#content/view/88487813)

I'm other words, you have stumbled across a unicode character that is represented by more than one BMP code unit (i.e. char) in the variable-length UTF-16 encoding used by Java.

Here is a common pattern for reading through unicode characters: https://stackoverflow.com/a/1527891/3988499 — Jared Stewart, Jun 30 '18 at 23:55
I believe in that last sentence you are incorrectly using the term [code point](https://en.m.wikipedia.org/wiki/Code_point). — Basil Bourque, Jul 01 '18 at 00:15

Why can't Java read this unicode char in UTF-8 file?

1 Answers1