1

As you will know, InputStreamReader will read the provided InputStream and decode its bytes into characters. If no charset is specified, it will use the default charset.

We can check this default charset with java.nio.charset.Charset.defaultCharset().displayName().

Case 1. My Windows CMD uses cp850, but Java reports windows-1252. It can be proven typing the character ó and System.in.read() will report 162, as expected. The InputStreamReader, though, will fail to decode it, as it expects to be running windows-1252, showing ¢ (this is the 162nd windows-1252 character).

Case 2. In Windows, my Netbeans integrated terminal uses windows-1252, but Java reports UTF-8. Again, it can be proven typing the character ó and System.in.read() will report 243, as expected. The InputStreamReader, though, will fail to decode it, as it expects to be running UTF-8, showing (code 65533).

Case 3. My Debian machine uses UTF-8 everywhere, in both GNOME and Netbeans terminals. When typing the character ó, System.in.read() will report two bytes, 195 and 161, which correspond to the UTF-8 representation of that character. The InputStreamReader will show ó as expected.

What I want? Is there a way to correctly detect the actual charset used so I can read characters from the command line (in Windows CMD and Netbeans in Windows) without any special case?

Thank you very much.

The B plan: Case 2 can be solved by changing Netbeans file encoding to UTF-8 (and it will handle UTF-8 files too, which is what an IDE should do in 2019). Case 1 could be solved by changing the codepage to UTF-8, but I have not been able to make that work.

You may use the following program to test these cases. Enter the same characters twice and compare the output.

import java.io.*;
import java.nio.charset.Charset;

public class Prova2 {
    public static void main(String[] args) throws Exception {
        int b;

        System.out.println("Charset.defaultCharset: " + Charset.defaultCharset().displayName());
        System.out.println("I will read the next bytes: ");
        while ((b = System.in.read()) != '\n') {
            System.out.println("I have read this byte: " + b + " (" + (char) b + ")");
        }
        System.out.println("I will read the next chars: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        while ((b = br.read()) != '\n') {
            System.out.println("I have read this char: " + b + " (" + (char) b + ")");
        }
        System.out.println("Thank you.");
    }

}
nerestaren
  • 156
  • 1
  • 12
  • 1
    Since `defaultCharset()` returns _"the default charset of this Java virtual machine"_, the way _"to correctly detect the actual charset used"_ would be to explicitly specify that when the JVM starts, but that doesn't seem to be straightforward. There are lots of SO questions on that (e.g. [Setting the default Java character encoding?](https://stackoverflow.com/q/361975/2985643)), but they're old, so it may be worth researching the current situation. Also see old JDK bug [JDK-5052844 : file.encoding parameter ignored on Intel Linux](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=5052844). – skomisa Feb 20 '19 at 21:23
  • Well, yeah, ideally I would that to be detected automatically. Thank you, though! – nerestaren Feb 20 '19 at 21:29

1 Answers1

1

Is there a way to correctly detect the actual charset used so I can read characters from the command line without any special case?

On Windows you can detect (or even set) the code page used when reading characters from the command line using JNA. However, that is not necessary if using an alternative approach to obtain the console input:

  • Instead of reading from System.in, use System.console to capture user input. This allows the submitted text to be processed as a String rather than bytes or chars. That provides access to all the String methods to interpret the console input as bytes, characters or UTF-8 data.
  • With this approach it is crucial to set a suitable code page before submitting input from the command line. For example, if submitting Russian characters then set the code page to 1251 using chcp 1251.

Getting user input can be achieved with just two lines of code with this approach:

Console console = System.console();
String userInput = console.readLine();

Case 2. In Windows, my Netbeans integrated terminal uses windows-1252...

Don't waste time trying to get console input working in NetBeans. System.console() will return null, and its console can't be configured. I suspect that similar limitations exist in other IDEs. Testing within NetBeans provides no meaningful benefits anyway. Just focus on testing from the command line.

Case 2 can be solved by changing Netbeans file encoding to UTF-8...

Using the approach below, the project's Encoding setting doesn't matter. It will work whether the encoding is set to Windows-1252 or UTF-8.

Notes:

  • I only tested on Windows but the code should work on other platforms as long as the console environment is set up correctly. (Using chcp is specific to Windows as far as I know.)
  • Like you, I could not get chcp 65001 to work for Unicode input. Just focus on ensuring that the input can be read successfully using a suitable code page. For example, when testing with the characters mentioned in the OP (óand ¢), using any code page supporting those two characters will work. For example: 437, 850, 1252, etc. If the application displays the characters that were submitted correctly then everything will be fine (and vice versa).

Here's the code, which mostly consists of displaying the console input:

package prova3;

import java.io.Console;
import java.io.UnsupportedEncodingException;
import java.nio.charset.StandardCharsets;
import java.util.stream.Collectors;

public class Prova3 {

    public static void main(String[] args) throws UnsupportedEncodingException {

        Console console = System.console();
        if (console == null) {
            System.out.println("System.console() return null.");
            System.out.println("If you are trying to run from within your IDE, use the command line instead.");
            return;
        }
        System.out.println("Enter some characters...");
        String userInput = console.readLine();
        System.out.println("User input:  " + userInput + " [String length: " + userInput.length() + ", chars: " + userInput.toCharArray().length + ", bytes: " + userInput.getBytes(StandardCharsets.UTF_8).length + "]");
        System.out.println("codepoints:  " + userInput.codePoints().boxed().map(n -> "x" + Integer.toHexString(n) + " (" + n + ")").collect(Collectors.toList()).toString());
        System.out.println("UTF-8 bytes: " + getBytesList(userInput));
    }

    static String getBytesList(String userInput) throws UnsupportedEncodingException {
        StringBuilder byteList = new StringBuilder("[");
        for (int i = 0; i < userInput.length(); i++) {
            byte[] bytes = userInput.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
            for (int j = 0; j < bytes.length; j++) {
                byteList.append(Character.forDigit((bytes[j] >> 4) & 0xF, 16))
                        .append(Character.forDigit((bytes[j] & 0xF), 16));
                if (j < bytes.length - 1) {
                    byteList.append(" ");
                }
            }
            if (i < userInput.length() - 1) {
                byteList.append(", ");
            }
        }
        byteList.append("]");
        return byteList.toString();
    }
}

chcp

skomisa
  • 16,436
  • 7
  • 61
  • 102
  • Thank you for your answer. Unfortunately, not supporting Netbeans is *not* an option. This is going to be used in class, 1st year, and students struggle too much already by looking at a console inside their IDE. I would not dare to force them to use the CMD every time they need to run their programs. – nerestaren Feb 27 '19 at 09:22
  • 1
    @nerestaren OK, I'll take another look. I think it might be helpful to update your question though, to make it clear that the requirement to _"read characters from the command line without any special case"_ must be done from within NetBeans. (That was not clear to me.) – skomisa Feb 27 '19 at 17:31