Java incorrectly reading accented characters from System.in

Question

If you are facing the same problem, and your character set is covered by the ANSI test encoding (codepage 1252 or "ISO 8859-1"), you could use that encoding instead to temporarily circumvent the problem with UTF-8, however UTF-8 is the modern standard that encompasses every script for ultimate localisation.

I'm creating an application that has to read user input containing accented characters from the console. From what I've read online, modern consoles are capable of handling accented character outputs, and correctly encoding inputs, even though they show up as ? before sending the command.

PS C:\> echo ?
ü
Ps C:\>

Note: this behaviour is not reproducible in Command Prompt. Command Prompt, when run in Windows Terminal, seems to display accented characters correctly before sending as well.

However, when running the following test code:

package com.test.outputtest;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.nio.file.*;

public class OutputTest {

    public static void main(String[] args) {
        // Set I/O to use UTF-8
        System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, StandardCharsets.UTF_8));

        // Create the response listener
        Scanner input = new Scanner(System.in, StandardCharsets.UTF_8);

        System.out.println(Arrays.toString("èéëê".getBytes(StandardCharsets.UTF_8)));
        String temp = input.nextLine();
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_8)));
    }

}

this is the output (after building the artifact "app.jar"):

PS C:\Users\[name]\Desktop\output_test> chcp 65001
Active code page: 65001
PS C:\Users\[name]\Desktop\output_test> java "-Dfile.encoding=UTF-8" -jar app.jar
[-61, -88, -61, -87, -61, -85, -61, -86]
èéëê
[0, 0, 0, 0]

The first array of bytes comes from the pre-written string, the second array is the bytes of the inputted string. The fact that echo outputs accents correctly leads me to believe that this is a compiler error, but I'm not sure how to fix it. I've tried replacing the Scanner with Console, that gave me the same error.

When running inside of IntelliJ, the ü is read completely normally when inputting it in the terminal. This is also a reason why I suspect a problem during compilation. When running with command prompt instead of PowerShell, the same error occurs.

Note: I'm using Windows Terminal running PowerShell and using IntelliJ Idea Community Edition 2021.3. I have not edited the .xml files besides the artifact building file path and some other project-specific file paths.

OS: Windows 10 build 19045.2728
Java version: 17.0.6 (Also in IntelliJ)
Default codepage: 850 (OEM)
Codepage used in which the error occured: 65001 (UTF-8)

`-D"file.encoding=UTF=8"` is wrong, 1) it's UTF-8, not UTF=8, and 2) as far as I know the quotes should enclose the entire argument or be left out. That is `-Dfile.encoding=UTF-8` or `"-Dfile.encoding=UTF-8` — Mark Rotteveel, Apr 04 '23 at 10:19
Sorry, the = was a typo. I've changed the quotes to enclose the entire argument, but the error persists. — ShadeOfLight, Apr 04 '23 at 10:53
Is the 65501 also a typo? Because on my system it results in _"Invalid code page"_, and the UTF-8 codepage is 65001, and 65501 is also not listed on https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers — Mark Rotteveel, Apr 04 '23 at 10:57
Yes, I'm terribly sorry. It should be 65001, which is UTF-8. — ShadeOfLight, Apr 04 '23 at 10:58
That said, I retract (and deleted) my previous comment, because I can reproduce the problem. — Mark Rotteveel, Apr 04 '23 at 11:02
Interestingly enough, when I don't use chcp 65001 (my system defaults to code page 437), it works fine and as expected. — Mark Rotteveel, Apr 04 '23 at 11:03
Very interesting indeed, but then the accented characters are not displayed correctly, even though they are not registered as 0. — ShadeOfLight, Apr 04 '23 at 11:30
`PS C:\> echo ? | Format-Hex` would be interesting. I'm going to take a guess that the hex would involve 0XFC, which is *not* Unicode (Don't change codepage when you do that) — g00se, Apr 04 '23 at 12:11
The output for ü as ? was 3F in position 0000 in codepage 65001, if you were curious. — ShadeOfLight, Apr 04 '23 at 13:18
It seems that by changing UTF-8 out in favour of ANSI, the characters are displayed correctly and the inputs are read correctly. However, this is far from a perfect solution for me, given the prototype I'm developing is supposed to be a language-learning tool. — ShadeOfLight, Apr 04 '23 at 14:46
List version number for Windows OS, PowerShell app, and Java. — Basil Bourque, Apr 04 '23 at 15:18
What happens when you run within IntelliJ, using its built-in console? Describe its character encoding setting, and its behavior. — Basil Bourque, Apr 04 '23 at 15:20
More information is needed to properly reproduce your issue. Update your question with: [1] Your version of Java when running in PowerShell (`java --version`). [2] Your version of Java when running within Intellij. [3] Per the previous comment from Basil, what happens when running within Intellij? [4] What happens when you run from Windows Terminal using _Command Prompt_ instead of _PowerShell_? [5] What is your _default_ code page. That is, what do you see if you open a new PowerShell window and just submit `chcp`? [6] The value returned by `chcp` when **echo ü** is rendered as **echo ?**. — skomisa, Apr 04 '23 at 16:11
I have added the information requested. Hopefully it is of some use to you. — ShadeOfLight, Apr 04 '23 at 16:59
*The output for ü as ? was 3F in position 0000 in codepage 6500* Ah no, that would be from entering a *literal question mark*. You started off with *modern consoles are capable of handling accented character outputs, and correctly encoding inputs, even though they show up as ?* So that was, I'm afraid, a wasted exercise. You were meant to be doing that with the 'real' character you wanted to examine. Of course, it *could* have been showing a real question mark as encoding had already failed before it even got to your app — g00se, Apr 04 '23 at 17:41
Interestingly enough, the ü is displayed correctly as ü prior to sending the command `echo ü` in Command Prompt. In PowerShell, codepage 65001 breaks the ü character prior to sending it for some reason. — ShadeOfLight, Apr 04 '23 at 18:36
@ShadeOfLight OK. I can reproduce your problem: works within Intellij, but fails from Windows Terminal when using the PS and cmd windows. A few minor differences in our environments: I am using Java 19 on Win 10 with IntelliJ IDEA 2023.1 (Ultimate Edition). Also, my default code page is 65001 so I don't need to use chcp. An interesting problem. — skomisa, Apr 04 '23 at 18:40
*in Command Prompt. In PowerShell, codepage 65001 breaks the ü...* What did I say? *Don't* change the codepage! — g00se, Apr 04 '23 at 21:13
@g00se I've added the results for codepage 1252 by editing my question. As for my system default, which is codepage 850, the output is again 3F in 0000, the same as codepage 65001. Apologies for my lazy skimming. — ShadeOfLight, Apr 04 '23 at 21:35

score 1 · Accepted Answer · answered Apr 04 '23 at 22:23

I can reproduce your problem, but I see nothing wrong with your code and I have no easy solution. Incredibly, it seems that even with the most recent versions of Java (18, 19, 20), reading UTF-8 characters from a Windows console remains problematic.

This is formally documented in JDK bug JDK-8295672 Provide a better alternative to reading System.in which is open and unresolved. It states (with my emphasis added):

Reading System.in is problematic as it is an input stream encoded in the host's encoding. With the JEP 400, there are cases where the default encoding (UTF-8) and host's native encoding differ. To read the bytes correctly, users would have to convert the bytes native-to-default, which seems to be an obstacle for basic usage. Providing a better means to access (w/o considering encoding stuff) would be appropriate.

So setting the default charset to UTF-8 does not resolve the issue because the "host's native encoding" is not UTF-8, and there is nothing you can do about that (at least with respect to cmd.exe and PowerShell on Windows).

Notes:

My understanding is that this is only an issue on Windows. Linux and Mac handle UTF-8 input without problems.
A possible workaround is using JNA (Java Native Access) methods to read the console input instead of using a Scanner. See How do I read the contents from an open Windows Console (Command Prompt) using Java Native Access to help get you started. Also see the Javadoc for JNA's WinCon interface, especially ReadConsoleInput().
Although it won't resolve your problem, you might consider upgrading to a more recent version of Java (18, 19 or 20) because of the implementation of JEP 400: UTF-8 by Default in Java 18. This was one of the goals of JEP400 (with my emphasis added):

Standardize on UTF-8 throughout the standard Java APIs, except for console I/O.

Presumably console I/O was excluded in JEP400 because of the "host's encoding" issue mentioned above.
An obvious question arising is why does your code work when run within Intellij? I suspect that is because JetBrains uses JNA to read the input from their console, but that's just a guess.

Thank you for your elaborate research! Since I'm not experienced at all with Java or programming for that matter, I've got one last question. If I use JDK 18, will a user running the jarfile have to as well? I assume the user could just use JRE 8 or 9, right? — ShadeOfLight, Apr 05 '23 at 10:42
@ShadeOfLight No, the user couldn't use JRE 8/9. That wouldn't work unless you also developed your application using Java 8/9, which would not be a good idea. Deploying Java applications to other users is not straightforward, and there are several issues to consider: Whether there is any Java environment on the target machine, and if so, what, the O/S of the target machine, etc. I can only suggest that you research further, and create a new question here if you have a _specific_ concern. As an introduction, [this might help](https://dzone.com/articles/the-skinny-on-fat-thin-hollow-and-uber). — skomisa, Apr 06 '23 at 03:46

Java incorrectly reading accented characters from System.in

1 Answers1