0

I cannot understand how System.in.read() method works.

There is such a code:

    public static void main(String[] args) throws IOException {
        while (true){
            Integer x = System.in.read();
            System.out.println(Integer.toString(x, 2));
        }

I know that System.in.read() method reads from the inputstream PER ONE BYTE.

So when I enter 'A'(U+0041, one byte is used to store the char) - the program output is:

 1000001 (U+0041)
 1010 (NL) - it works as expected.

But when I enter 'Я'(U+042F, two bytes are used to store the char) - the output is:

 11010000 (byte1)
 10101111 (byte2)
 1010 (byte3 - NL)

The real code for letter 'Я'(U+042F) is 10000101111.

Why 11010000 10101111 (byte1 + byte2) is not the binary code for letter 'Я'(U+042F)?

dimmxx
  • 131
  • 1
  • 10
  • 2
    Because `11010000 10101111` is the Unicode character U+042F encoded in UTF-8. – MC Emperor Sep 24 '19 at 17:07
  • A text stream *encodes* the characters in binary. Your input stream is probably in UTF8, otherwise the ASCII range would not be single-byte but would have the `00` part as well. Read about UTF-8. – RealSkeptic Sep 24 '19 at 17:08
  • See https://stackoverflow.com/questions/40124088/if-%e2%84%a4-is-in-the-bmp-why-isnt-it-encoded-in-2-bytes/40124195#40124195 – MC Emperor Sep 24 '19 at 17:09
  • `'A'` is within single-byte territory: decimal value 65, well within the -128 to 127 numeric range of a single byte. `'Я'` is multi-byte becuase the decimal value 1,071 cannot be represented by a single byte. – Kaan Sep 24 '19 at 17:16
  • As a side note, you should use `int` here. There is no reason to box the value into an `Integer` object. – Holger Sep 25 '19 at 13:16

1 Answers1

2

This will depend on the external process that is sending data to System.in. It could be a command shell, an IDE, or another process.

In the typical case of a command shell, the shell will have a character encoding configured. (chcp on Windows, locale charmap on Linux.)

The character encoding determines how a graphical character or glyph is coded as a number. For example, a Windows machine might use a "code page" of "Windows-1251" and encode "Я" as one byte (0xCF). Or, it could use UTF-8 and encode "Я" as two bytes (0xD0 0xAF), or UTF-16 and use two different bytes (0x04 0x2F).

Your results show that the process sending data to your Java program is using UTF-8 as an encoding.

erickson
  • 265,237
  • 58
  • 395
  • 493