Multiple Unicode blocks erroring

Question

So I've been working on a little something to help me with some maths and display working and all is working well, yet for the console output I need to be able to display Unicode superscript and subscript. I had it set up initially with the following function for superscript:

public static String getBase( int num ){
    String uniStr = "\\u207";
    String numStr = Integer.toString(num);
    String res = "";

    for( int i = 0; i < numStr.length(); i++ ){ 
        String s = uniStr + numStr.charAt(i);
        char c = (char) Integer.parseInt( s.substring(2), 16 );
        res += c;
    }
    return res;
}

That worked fine to an extent, but when using the following call to debug:

System.out.println(Unicode.getBase(1234567890));

I got the output:

ⁱ⁲⁳⁴⁵⁶⁷⁸⁹⁰

due to the superscript for 1, 2 and 3 having far different unicode IDs in the Latin1-Supplement unicode block, as opposed to being in the superscript/subscript block with the other characters. So I added a switch statement in an attempt to handle these 3 specifically, resulting in:

public static String getBase( int num ){
    String uniStr = "\\u207";
    String numStr = Integer.toString(num);
    String res = "";
    for( int i = 0; i < numStr.length(); i++ ){ 

        String s = "";
        switch(numStr.charAt(i))
        {
        case '1':
            s = "\\u00B9";
            break;
        case '2':
            s = "\\u00B2";
            break;
        case '3':
            s = "\\u00B3";
            break;
        default:    
            s = uniStr + numStr.charAt(i);
        }

        char c = (char) Integer.parseInt( s.substring(2), 16 );
        res += c;
    }
    return res;
}

And yet now I get the output:

¹²³⁴⁵⁶⁷⁸⁹⁰

except for the fact that only 1, 2 and 3 now display in the console, with 4-0 all having the invalid character box like:

¹²³ࢆࢆࢆࢆࢆࢆࢆ

I know for a fact that the switch works, as proven by 1, 2 and 3 all showing correctly, and the parsing of the string for the other characters works also, yet this still happens. For the life of me I cannot find a solution or even a reason to this. If I use characters from one Unicode block is it trying to grab all further characters from that range also, and if so is there anything I can do about it? That seems the only likely cause I can think of, otherwise I'm well and truly stumped. Any and all help would be hugely appreciated.

P.S. I have my run configs in Eclipse set to UTF-8 and all these characters are supported

Possible duplicate of [Printing out unicode from Java code issue in windows console](http://stackoverflow.com/questions/20386335/printing-out-unicode-from-java-code-issue-in-windows-console) — phuclv, Mar 17 '16 at 08:14
the problem is Windows console Unicode support, not Java. [Unicode input in a console application in Java](http://stackoverflow.com/q/8669056/995714) — phuclv, Mar 17 '16 at 08:14

score 0 · Answer 1 · edited May 23 '17 at 11:45

To anyone curious, "Lưu Vĩnh Phúc" was correct in that it's an issue with Windows console being rather... unkind in terms of unicode. There's certainly temp-fixes available but nothing particularly pleasant.

The first I found thanks to digging through the link provided in his second comment and proceeding to squirm my way through the interwebs on a link-spree. Here in this answer by erickson (based on this one by "Edward Grech") it is explained that you can set an environment variable named JAVA_TOOL_OPTIONS to java -Dfile.encoding=UTF-8 … com.x.Mainin order to get the Eclipse console working properly if you're only planning on running it locally and have no need to build the project to share. Not ideal, but it works. Not supported however, so you run some risks.

The second is slightly more user-friendly and found here, an answer by "spider". This uses -Dfile too, but in the command line as opposed to setting it as an environment variable, meaning you could create a tidy little batch file and use the command chcp 65001 to set the windows console's default code page to 65001 (Unicode).

C:\>chcp 65001
C:\>java -jar -Dfile.encoding=UTF-8 path/to/your/runnable/jar

^Console input, quoted from "spider". This is essentially what you'd modify and add to the run.bat used to run your jar in cmd.

The third is on this page, posted by "McDowell", at the very bottom under

Printing characters as UTF-8

This method involves manipulating the console to work as a file handle and has some convenient code snippets.

Eclipse doesn't need any special settings to get UTF-8 output working. The "Encoding" option in Run Configurations dictates the encoding the forked process will pick up. — Alastair McCormack, Mar 17 '16 at 10:53
`chcp 65001` is a bastardised UTF-8 charmap. It doesn't support the full Unicode map and input is broken — Alastair McCormack, Mar 17 '16 at 10:54

Multiple Unicode blocks erroring

1 Answers1