8

I have the following code

public class MainDefault {
        public static void main (String[] args) {
                System.out.println("²³");
                System.out.println(Arrays.toString("²³".getBytes()));
        }
}

But can't seem to print the special characters to the console

When I do the following, I get the following result

$ javac MainDefault.java
$ java MainDefault

MainDefaultPrinting

On the other hand, when I compile it and run it like this

$ javac -encoding UTF8 MainDefault.java
$ java MainDefault

MainDefaultUTF8CompilationOnly

And when I run it using the file encoding UTF8 flag, I get the following

$ java -Dfile.encoding=UTF8 MainDefault

MainDefaultUTF8CompilationAndRun

It's doesn't seem to be a problem with the console (Git Bash on Windows 10), as it prints the characters normally

Echo

Thanks for your help

Yassin Hajaj
  • 21,337
  • 9
  • 51
  • 89
  • Maybe [this](https://stackoverflow.com/questions/48402025/unicode-output-java-windows-cmd) or [this one](https://stackoverflow.com/questions/2168350/java-charset-problem-on-linux) (I tried this from IntelliJ and saw the correct output) – Gryphon Sep 02 '20 at 19:16
  • 1
    The sequence of numbers that comprise the string -- -62,-78,-62,-77 -- are (as unsigned bytes) 0xC2,0xB2,0xC2,0xB3. These are the CP437 values for the ASCII box characters that appear in the screenshots. These values probably appear in other character sets as well, but not in UTF-8 or even ISO-88591-1. It looks as if either the file being compiled is not UTF-8, or the terminal displaying the output is not set up to display UTF-8. If the problem is in the encoding of the file, then System.out.println("\u00B2\u00B3") should produce the correct output, as these are the Unicode escapes for ²³ – Kevin Boone Sep 02 '20 at 21:21
  • I get the expected output on Mac and also on [Git Bash for Mac](https://github.com/fabriziocucci/git-bash-for-mac). Probably, it's a problem with Windows. – Arvind Kumar Avinash Sep 05 '20 at 11:22

8 Answers8

12

Your code are not printing the right characters in the console because your Java program and the console are using different character sets, different encodings.

If you want to obtain the same characters, you first need to determine which character sets are in place.

This process will depend on the "console" in which you are outputting your results.

If you are working with Windows and cmd, as @RickJames suggested, you can use the chcp command to determine the active code page.

Oracle provides the Java full supported encodings information, and the correspondence with other alias - code pages in this case - in this page.

This stackoverflow answer also provides some guidance about the mapping between Windows Code Pages and Java charsets.

As you can see in the provided links, the code page for UTF-8 is 65001.

If you are using Git Bash (MinTTY), you can follow @kriegaex instructions to verify or configure UTF-8 as the terminal emulator encoding.

Linux and UNIX, or UNIX derived systems like Mac OS, do not use code page identifiers, but locales. The locale information can vary between systems, but you can either use the locale command or try to inspect the LC_* system variables to find the required information.

This is the output of the locale command in my system:

LANG="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_CTYPE="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_ALL=

Once you know this information, you need to run your Java program with the file.encoding VM option corresponding to the right charset:

java -Dfile.encoding=UTF8 MainDefault

Some classes, like PrintStream or PrintWriter, allows you to indicate the Charset in which the information will be outputted.

The -encoding javac option only allows you to specify the character encoding used by source files.

If you are using Windows with Git Bash, consider also reading this @rmunge answer: it provides information about a possible bug in the tool that may be the reason for the problem and that prevents the terminal from running correctly out of the box without the need for manual encoding adjustments.

jccampanero
  • 50,989
  • 3
  • 20
  • 49
  • Hi JCCampanero, this helped me find the correct answer so I'll consider it as the valid one. What I did to print the characters to the console was to use `chcp.com 65001` and then ran my scripts again, and it worked :D – Yassin Hajaj Sep 12 '20 at 07:19
  • 1
    Thank you very much @YassinHajaj! I am very happy to know that the answer was helpful. – jccampanero Sep 12 '20 at 08:46
  • @YassinHajaj please note that chcp only changes the OEM codepage of the running console. After closing and opening the console again the codepage will be set back to default. – rmunge Sep 13 '20 at 16:57
  • This answer does not explain the reason for the described behavior. It only gives some general hints for workarounds that shouldn't be required. – rmunge Sep 13 '20 at 17:10
  • I am sorry you think the answer is not correct @rmunge. In it, I just tried to explain that. regardless of the output system (terminal, console, etc.), the problem arises from using a different character set in Java and on that output system, and tried to help the user how he can solve the problem or avoid it. – jccampanero Sep 13 '20 at 17:33
  • @jccampanero I didn't write that your answer is wrong. But changing the codepage and overriding file.encoding to UTF-8 shouldn't be necessary if the used Git Bash version would work correctly. The root cause of the problem is a bug in Git for Windows, resp. MSYS2 (see my latest answer to the question for details). You are describing a valid temporal workaround but since changing the codepage through chcp does only impacts current console it is just a workaround and not a real solution. – rmunge Sep 13 '20 at 18:29
  • @jccampanero don't get me wrong. your answer is not wrong. i just want to avoid that others who are facing the same issue start playing around with codepages and charsets after reading the accepted answer although an update of git for windows could solve the problem easily. please check my second answer to this question, maybe you could also add a small hint to your accepted answer. then i would of course upvote your answer again – rmunge Sep 13 '20 at 18:41
  • 1
    @rmunge Please, do not worry. In fact, your answer is very well documented and provide a great background information. I upvoted it and updated my answer with a reference to it. At the end, the only important thing is that the answer is as useful as possible to the users. – jccampanero Sep 14 '20 at 12:40
  • Thanks @jccampanero. Collaborative answering at its best :-) – rmunge Sep 14 '20 at 16:16
5

I am also using the Git Bash on Windows 10 and It works totally fine for me.

Here's how it prints,

Trying to reproduce it in Git Bash on Windows 10

Terminal version is mintty 3.0.2 (x86_64-pc-msys) and My text properties were,

enter image description here

So, I tried to reproduce your outputs by changing Character Sets;

enter image description here

By setting Character Set to CP437 (OEM codepage) (Note that this automatically changed Locale to C too), I could be able to get the output as you got.

enter image description here

And then after when I change it back to UTF-8 (Unicode), the I could get the output as expected!

enter image description here

Therefore, it is clear that the problem is with your console's Character Set.

Tharindu Sathischandra
  • 1,654
  • 1
  • 15
  • 37
  • This is basically a duplicate of my own answer from a few days ago, even the screenshots are the same. This one is just more verbose. – kriegaex Sep 11 '20 at 04:00
  • @kriegaex I wanted reproduce and resolve the problem, so then we could clearly understand where the fault is. So, I posted what I did. – Tharindu Sathischandra Sep 11 '20 at 13:53
5

The short version:

The unexpected behavior is reproducible with the following setup:

  • Windows 10 with English, German or French language, or any other language that leads to ANSI and OEM codepages that encode ² and ³ differently

  • Git for Windows 2.27.0 (installed with default setting i.e. configured to use MinTTY and experimental support for pseudo consoles disabled)

  • Source code is stored in UTF-8 encoding

To get correct bahavior:

  • Either re-install Git for Windows 2.27.0 and enable experimental support for pseudo consoles on the last page of the installer or upgrade to latest 2.28 version

  • Compile your code with javac -encoding UTF8

  • Call java without overriding file.encoding

The medium version:

Git for Windows 2.27.0 uses a version of MSYS2 that does not set the code page for MinTTY by calling SetConsoleCP when support for pseudo consoles is disabled. The Java runtime determines the codepage for System.out by calling GetConsoleCP. Since no codepage is set when Java is executed within MinTTY terminal, the call fails and Java uses the charset returned by Charset.defaultCharset() as fallback. But in a Windows installation as describe above, Charset.defaultCharset() returns Cp-1252 while the default charset for consoles is Cp-850. The two codepages are not fully compatible. This leads to the strange output.

The long version:

Windows has two types of codepages: ANSI and OEM codepages. The first type is intended for UI applications that do not support Unicode and the later is used for console applications. Both types encode a single character in 1 Byte but they are not fully compatible.

Therefore on Windows Java has to deal with two charsets instead of one:

  • Charset.defaultCharset() returns the ANSI codepage (usually cp-1252). This charset is specified by the file.encoding system property. If not specified as VM argument, the java executable determines the ANSI codepage and adds the system property during initialization. String.getBytes() uses the charset returned by Charset.defaultCharset().
  • System.out uses the OEM codepage for consoles (usually cp-850). The java executable gets this codepage by calling the GetConsoleCP function and sets the it as value for the internal system properties, sun.stdout.encoding and sun.stdout.encoding. When the call to GetConsoleCP fails the charset returned by Charset.defaultCharset() is used. This only happens when the console in which java.exe is executed hasn't set the OEM codepage before, by calling SetConsoleCP

So what happens now in the setup mentioned above?

$ javac MainDefault.java
$ java MainDefault

enter image description here

The native call of GetConsoleCP fails due to the bug in MSYS2. Therefore System.out falls back to the charset returned by Charset.defaultCharset() which is cp-1252. But the OEM codepage of the console is cp-850. Therefore System.out.println("²³") produces unexpected output.

The source code is stored in UTF-8. Encoding "²³" in UTF-8 requires 4 Bytes. But due to the missing -encoding parameter javac assumes default encoding that uses one byte per character. Therefore it interprets the 4 Bytes as 4 characters. String.getBytes uses the 1-Byte, based ANSI code page, cp-1252 and therefore returns 4 bytes.

$ javac -encoding UTF8 MainDefault.java
$ java MainDefault

enter image description here

With the -encoding UTF8 parameter javac interprets the UTF-8 encoded source as UTF-8. So the 4 bytes of "²³" are correclty recognized as two characters. System.out encodes the two characters in cp-1252 which leads to 2 bytes. But since the console still uses cp-850 the output is still corrupted. String.getBytes encodes the wo characters also in cp-1252 which leads to 2 bytes.

$ java -Dfile.encoding=UTF8 MainDefault

enter image description here

The system property, file.encoding overrides the charset returned by Charset.defaultCharset() that is also used by String.getBytes(). The two characters which were first wrongly interpreted by javac as 4 characters in 8-Bit encoding are now correclty encoded in UTF-8 as two characters encoded in two bytes per character. This leads to 4 bytes. Since file.encoding does not have any effect on the charset that is used by System.out the 4 (and not 2, due the wrong interpretation of javac) characters are still encoded in cp-1252, the console still uses cp-850 and you get still a corrupted output.

enter image description here

Your console can print ²³ since the console's 8-Bit OEM code page (cp-850) supports both characters. But it encodes it slightly different than the ANSI code page cp-1252 that is used by System.out ;-)

rmunge
  • 3,653
  • 5
  • 19
  • Hello @rmunge, thank you for the extended answer, but I swear that using chcp.com 65001 fixed it for me, it might be interesting to add it there – Yassin Hajaj Sep 14 '20 at 10:37
  • 1
    @YassinHajaj using chcp.com 65001 changes the OEM codepage to UTF-8, -Dfile.encoding changes the charset returned by Charset.defaultCharset() also to UTF-8. Due to the bug in Git Bash the charset of System.out also falls back to UTF-8. Since the console now also uses UTF-8, everything works and it is a valid workaround. The workaround will also work with newer Git Bash versions that do not contain the bug, but be aware that you have to execute chcp.com on every new console before you execute java ;-) – rmunge Sep 14 '20 at 16:12
  • @rmunge, for me the default code page after upgrading to 2.28 is US-ASCII (ID 20127, also confirmed by chcp.com). Experimental console support is enabled. So for me your workaround does not work as described, I need a combination of UTF-8 mintyy in settings, `chcp.com 850` (how strange!) in Git Bash and optionally `-Dfile.encoding=UTF8` (actually has no effect, also super strange). My Windows 10 is German, BTW. Now I regret having upgraded my older Git version after getting curious about your solution. Before, my own solution worked just fine, not it no longer does. Git/MSYS2 is buggy now! – kriegaex Sep 15 '20 at 04:03
  • `chcp.com 65001` also works, but without chcp the Java output is always wrong. Just in case you are involved in MSYS2 or Git for Windows, do you know if this is being actively worked on and maybe fixed anytime soon? Otherwise I might downgrade my Git again. – kriegaex Sep 15 '20 at 04:09
  • Okay, I just downgraded to Git 2.15.1.windows.2 (I still had the old installer from 2017 in my downloads folder) and everything is fine again, my own solution works and I did not notice any immediate problems after the forced downgrade. – kriegaex Sep 15 '20 at 04:13
  • @kriegaex I also have a Windows 10 with German language and my default code page is 850. cp-20127 is actually a 7-Bit US-ASCII, that's really strange. What does chcp.com return when you execute it within cmd.exe? Maybe it's a general configuration issue with your Windows 10. Windows 10 German should have code page 850 as default. When your console codepage only supports 7-Bit ASCII then it is of course impossible to write any Unicode character that are not covered by US-ASCII. – rmunge Sep 15 '20 at 14:35
  • @kriegaex As already explained in my answer: -Dfile.encoding=UTF8 actually does not have any impact on the charset that is used by System.out on a Windows OS. And no, I'm not involved in the development of Git for windows. But I also don't think that your issue directly realtes to Git Bash. Sounds more like a general issue with the default OEM codepage. So output of chcp witin windows console /cmd.exe) would be interesting. – rmunge Sep 15 '20 at 14:38
  • `-Dfile.encoding` **does** have an impact on the output seen on Git Bash in my case (now reverted to Git 2.15 where it works normally, like I said). Like I also already said, I know 20127 is US-ASCII. In my normal CMD I call `chcp 65001`, I put that into the registry for auto-start. I had this setting for years. So it is also UTF-8. In Git Bash 2.15 `chcp.com` shows 850, in 2.28 it shows 20127. In current Git versions console handling is simply broken, your workaround (which doesn't work as described for me) is not my way to go but a downgrade, which works beautifully. – kriegaex Sep 16 '20 at 00:22
4

The hex codes look okay for UTF-8. Maybe your character set for Git Bash is not UTF-8. For me it looks like this:

Text and font settings for mintty (Git Bash)

The console output then also looks fine:

Console output UTF-8


Update 2020-09-13: Here is proof that chcp.com <codepage> does not work in Git Bash (mintty). It has no effect whatsoever. You really do have to select the correct codepage in the mintty settings dialogue.

screen recording of Git Bash mintty


Update 2020-09-15: Okay, after I read @rmunge's answer I upgraded to Git 2.28 and could reproduce the OP's problem and also use the chcp workaround (it did not work as described by @rmunge in my case). Because Git (or MSYS2, respectively) are so buggy in the latest versions and I don't wish to use chcp.com from inside Git Bash every time I open a new console, I just downgraded to version 2.15.1 which I had used for 3 years without any problems before. Maybe there are later versions without the console bug, I did not try but just use my old installer from the downloads folder on my computer. I recommend everyone to do the same and now work around this ugly bug. With a non-buggy console version, it just works like I described.

kriegaex
  • 63,017
  • 15
  • 111
  • 202
  • Thank you so much for your time but this does not solve it unfortunately – Yassin Hajaj Sep 12 '20 at 07:20
  • You are being unfair! The solution you accepted is for CMD, but the screenshots you posted are from Git Bash (which you are also mentioning in the text) with mintty terminal, I can see it because the command prompt is from a Bash-like shell. The chcp command is completely useless for mintty and Git Bash. For your question my answer is correct, the accepted answer is for CMD only. How can you accept a question and award 500 points for something answering a question you didn't ask? – kriegaex Sep 12 '20 at 09:39
  • Hi @kriegarx... chcp.com worked on git bash, i dont really understand your point.. it works and I award the bounty to the answer thats the closest to reality, his answer is that one.. no need to get fully emotional, those are virtual points.. – Yassin Hajaj Sep 12 '20 at 10:44
  • No, it doesn't. I just re-tested it. `chcp.com 65001` has no effect on mintty Git Bash. Even though a subsequent `chcp.com` displays "65001", the output is wrong. The result only looks correct when in the mintty console settings you select UTF-8. If you like, I can post a screen video proving it. If you claim otherwise, please post a screen video proving what you say. It is just not true! – kriegaex Sep 13 '20 at 04:38
  • **Update:** I just added a screen recording (animated GIF) to my answer. It proves that my answer is correct. Like I said, `chcp.com` only works in CMD console. Even the guy who answered the question confirms that and points to my answer for Git Bash. You thanked him for `chcp 65001` even though you never asked about CMD but explicitly about Git Bash. – kriegaex Sep 13 '20 at 05:02
  • I promise Ill verify what youre saying once I open my second PC. – Yassin Hajaj Sep 13 '20 at 09:26
  • I verified, and it turns out chcp.com fixed it.. Nothing else – Yassin Hajaj Sep 14 '20 at 10:35
  • I posted a video proving what I said. You write one sentence without any proof, so I don't believe you. Where is your screen video comparable to mine? – kriegaex Sep 14 '20 at 16:12
  • Okay, because of [this new answer](https://stackoverflow.com/a/63872923/1082681) I became curious and upgraded to the latest Git version. I was using an older one behaving as described here. I never had problems with it. The new one 2.28.0 is broken, for me even more broken than described in the other answer. So there are now bugs which did not exist in my older Git version. I assume that you also use 2.27 or 2.28, hence you have the problems described here and need to abuse chcp.com which actually ought not have any effect on mintty. Probably it does due to the experimental console support. – kriegaex Sep 15 '20 at 03:58
  • Please also note my latest update in the answer above. My recommendation is a Git downgrade to a version without console bugs in MSYS2 instead of ugly workarounds, if you do not depend on the very latest bleeding-edge Git features (I surely don't, 2.15 is fine for me). – kriegaex Sep 15 '20 at 04:20
1

On Windows, it has to do with your code page. You can use the command chcp to set the code page you want (for eg: if you want to set it up for a specific program launched) or you can specify the charset corresponding to the codepage in the java commanline.

If the current codepage does not support the characters you are printing, you will see garbage in the console.

The reason why different shells may behave differently is due to the codepage/charsets that are loaded by default.

Please check out this SO post for how it is done: System.out character encoding

vvg
  • 1,010
  • 7
  • 25
1

I encountered the same problem in git bash for Windows. java and javac cannot print Chinese characters properly. Setting git-bash's character set as UTF8 does not help. chcp does not work either. From git bash's installation wizard, I had known that programs like python do not work properly without winpty. I had added alias python='winpty python to ~/.bashrc. So I tried winpty java Foo.java and winpty javac Foo.java, and luckily the problem was gone. I added the aliases to ~/.bashrc to fix the problem:

alias java='winpty java'
alias javac='wintpy javac'

The recent versions(v2.2x) of git bash for Windows have included an experimental feature about winpty, but it seems it still has some problems, so I've kept these aliases so far.

ElpieKay
  • 27,194
  • 6
  • 32
  • 53
0

Hex C2B2 C2B3, when interpreted as UTF-8 is ²³.

I assume you are using a Windows "cmd terminal"?

The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Both the screenshots and the OP's own words tell you that he is using Git Bash, not cmd.exe. ;-) – kriegaex Sep 05 '20 at 11:05
  • Bash is a scripting language; cmd is the rendering app. They are separate animals. (You need both.) – Rick James Sep 05 '20 at 21:10
  • 1
    You want to nitpick here? Then I will, too: Both Cmd and Bash are shells (command processors). You can start Bash from Cmd and vice versa. If you start Git Bash on Windows by just clicking the icon, a mintty terminal emulator window in a separate process is automatically started, just like when you click on the Windows terminal icon it also starts a conhost.exe for the terminal. If you open subshells, no additional terminal processes will be started. See also my answer here with mintty settings screenshots. Of course starting bash.exe from Windows terminal is possible, but a rare use case. – kriegaex Sep 06 '20 at 04:30
0

Please verify that your Windows 10 installation does not have Unicode UTF-8 support enabled. You can see this option by going to Settings and then: All Settings -> Time & Language -> Language -> "Administrative Language Settings"

This is what it looks like - the feature should be unchecked.

enter image description here

Rationale:

"²³".getBytes() returns the encoding of the string, based on the detected default charset. On a Windows 10 system the default charset should usually be a 1-Byte based encoding, independent from whether you launch java.exe from a Windows console or from Git Bash. But your first screenshot shows a 4-Byte encoding that is actually UTF-8. So your JVM seems to detect UTF-8 as the wrong default charset that is incompatible with the codepage of your console.

Your console can print ²³ because both characters are supported by the used code page, but the encoding is based on one byte per character while UTF-8 encoding requires 2 Bytes for each of these two characters.

I have no simple explanation for your second screenshot but be aware that Git Bash is based on MSYS2 which again uses mintty terminal emulator. While MSYS2 uses UTF-8, and mintty also seems to support UTF-8 the whole thing is wrapped within a Windows console that is based on an OEM codepage that is incompatible to UTF-8. The whole thing then runs on an operating system that internally uses UTF-16. Now combined with a beta setting that overrules the whole OEM codebase concept on OS-level this setup provides enough complexity for some incomprehensible behavior.

rmunge
  • 3,653
  • 5
  • 19
  • Thank you so much for your time but this does not solve it unfortunately – Yassin Hajaj Sep 12 '20 at 07:20
  • 1
    Too bad. :-( Please extend your code with *System.out.println(Charset.defaultCharset().name());* and share the output when you execute it in Git Bash and cmd.exe (without specifying any additional VM args). Would be interesting to know if my first assumption about the wrong default charset is right and whether there's a difference between Git Bash and cmd.exe. – rmunge Sep 12 '20 at 11:23
  • @YassinHajaj how do you exactly start your Git Bash? Do you click on the "Git Bash" shortcut or do you execute git-bash.exe or even bash.exe directly? What java version do you use? – rmunge Sep 12 '20 at 11:33
  • Hey @rmunge, to give you a bit more context, the popup contained French & the box was unticked. I open git bash by right clicking in Explorer and cliking on "Open Git Bash" or sometimes I click on the windows start (bottom left) and type "Git.." and then enter because Git bash comes first – Yassin Hajaj Sep 12 '20 at 13:09
  • About the Sysout, I'll definitely provide you with this info soon, because it's on another PC that's not on ATM, once I have it on, I'll let you know – Yassin Hajaj Sep 12 '20 at 13:10
  • @YassinHajaj after digging a bit in JVM native code I think I have found the root cause. Please see the new answer. Hope that finally nails it down. Was a nice weekend puzzle ;-) – rmunge Sep 13 '20 at 16:20
  • Hi @rmunge, thank you very much for the effort, BTW I have printed the required info to the console if you want to take a look at. Familiar on how to create a chat here? – Yassin Hajaj Sep 14 '20 at 10:42