4

I've noticed in Node.js, when reading stdin from the Windows Console (conhost.exe), when you input a UTF-8 character it reads it just fine, with any code page.

I've been testing with an emoji (), but you can try it with whatever you want.

(both these programs were run in cmd.exe) Example code:

process.stdin.on("readable", () => {
    var input = process.stdin.read();
    if (input !== null) {
        console.log(input); // this will output the correct UTF-8 bytes, <Buffer f0 9f 98 8a 0d 0a>
        process.exit();
    }
});

Now testing with Java

import java.io.*;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) throws IOException {
        BufferedReader r = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
        String s = r.readLine();
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            System.out.println((int) c);
        }
    }
}

With the default code page (437), it outputs (63, 63), while with 65001 (UTF-8), it outputs two null bytes (0, 0), which is even stranger.

I thought the Windows console (conhost) didn't support Unicode, but Node can at least read the bytes intact (albeit not being able to display them as text). How can it do that, and is there a way I can get this behaviour in Java?

f478ccf2
  • 55
  • 2
  • What happens when printing `s`? – dan1st Apr 16 '23 at 17:22
  • Oh boy, Unicode is even stranger than you think – Vasily Liaskovsky Apr 16 '23 at 17:27
  • 1
    Does this answer your question? [What is a "surrogate pair" in Java?](https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java) – Heiko Theißen Apr 16 '23 at 17:59
  • Kind of TL;DR to Helko's link. String in Java represents text in UTF-16 format. Which uses 2 consecutive bytes if the Unicode code point value is higher than `FFFF` which is `1F60A` for the emoji in your case. This pair of consecutive bytes is known as surrogates (the leading being a higher-, the following being lower-surrogate). Finally, based on the docs of `String.charAt`, in case of a character encoded by surrogate pairs, the call returns a single surrogate each time. However, each surrogate doesn't represent any character. In fact, surrogate values are reserved and don't map to any symbol – Turkhan Badalov Apr 16 '23 at 18:07
  • @user16320675, for me it is working normal with both characters you've mentioned. Perhaps something to do with your default system encoding `env | grep LANG`? If it is different than `UTF-8`, reading in `UTF-8` can lead to corrupted results – Turkhan Badalov Apr 16 '23 at 19:59
  • @TurkhanBadalov I'm on Windows – f478ccf2 Apr 17 '23 at 20:43
  • @f478ccf2, I am sorry – Turkhan Badalov Apr 18 '23 at 07:31

1 Answers1

0

In javascript strings are raw bytes (in the end, processes have 'standard in' and 'standard out' and they do'nt have to be a keyboard and screen, they can e.g. be files - and these stdin and stdout are defined as byte streams after all - hence, they aren't chars!) - and you read them in and emit them right back out. It'll work regardless of encoding setting.

Whereas in the java code you are converting the bytes to strings and then back to bytes which is going to fail unless you have the right encoding.

In addition, you're running into a nitpicky thing about char in java: It's not.. actually a unicode value. You want .codepointAt(0), not .charAt(0). charAt gives you the first of a surrogate pair (in java, char is only 16-bit large which is enough for almost all chars, but not for e.g. emojis).

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • Wonky system setting? That is not normal behaviour and not something I can reproduce on this machine. – rzwitserloot Apr 16 '23 at 21:55
  • As I said, cannot reproduce, so, it's not 'java cannot support that', as it clearly does, here. – rzwitserloot Apr 16 '23 at 22:17
  • I've tried reading the raw bytes via System.in.read, but it produces the same result. What's going on? System.in is an input stream after all – f478ccf2 Apr 20 '23 at 22:12