0

"unicode.txt" UTF-8 file

アफਸᙡşüabÇİÜ⏩ア

The first character has 4 bytes. And when I run this code, I can't get the output that I expect

InputStream in = new FileInputStream("unicode.txt");
InputStreamReader inReader = new InputStreamReader(in, "UTF-8");
char ch = (char)inReader.read();
System.out.println(ch); // Writes '?' character to the console. Why ?

Why this code doesn't write '' character to the console ? And How can I write it ?

My default encoding:

System.out.println(System.getProperty("file.encoding")); // output: "UTF-8"
System.out.println(Charset.defaultCharset()); // output: "UTF-8"

I think, the problem is char data type.

Thanks.

1 JustOnly 1
  • 171
  • 11

1 Answers1

4

The char data type is based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of Unicode code points is now U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is called the basic multilingual plane (BMP), and characters whose code points are greater than U+FFFF are called supplementary characters. A char value, therefore, represents BMP code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points.

In particular, do not write code that assumes that a value of the primitive type char (or a Character object) fully represents a Unicode code point.

(From https://wiki.sei.cmu.edu/confluence/plugins/servlet/mobile?contentId=88487813#content/view/88487813)

I'm other words, you have stumbled across a unicode character that is represented by more than one BMP code unit (i.e. char) in the variable-length UTF-16 encoding used by Java.

Jared Stewart
  • 571
  • 3
  • 10