Converting unicode character to int gives incorrect code

Question

I'm fairly new to Java, so please be gentle.

This appears to be a common question, but I still seem to be unable to find the answer I'm looking for.

I'm writing a console app that will take a string of characters and print them out on the screen but bigger. For example: "JAVA" would print as:

 JJJJJ   A   V   V   A
   J    A A  V   V  A A
   J   A   A V   V A   A
   J   AAAAA V   V AAAAA
   J   A   A V   V A   A
 J J   A   A  V V  A   A
 JJJ   A   A   V   A   A

Nothing special there. The string gets broken down into characters, each character is then looked up in a large switch case, which then returns the bigger letter. After some wrapping is done where necessary, the big letters are glued together and printed.

That was too easy and since I like to make my life more challenging, I want to allow certain unicode characters, such as a black heart (❤) \u2674, (which is what the Windows character map claims it is, anyway). Basically, passing some kind of code into the parameter will be replaced internally within the strong and interpretted as a unicode character, for example: JAVA {HEART} might output (I know the heart is messed up, but it displays fine with a monospaced font):

 JJJJJ   A   V   V   A     ❤❤  ❤❤
   J    A A  V   V  A A   ❤❤❤❤❤❤
   J   A   A V   V A   A   ❤❤❤❤❤
   J   AAAAA V   V AAAAA    ❤❤❤❤
   J   A   A V   V A   A     ❤❤❤
 J J   A   A  V V  A   A      ❤❤
 JJJ   A   A   V   A   A       ❤

As far as I'm aware, the unicode should fit into a char (2 bytes) and should definitely fit into an int (4 bytes) so I did an experiment. Word on the street is that casting to an int will give you the character code.

String unicodeStr = "\u2674"; // Unicode for black heart.
System.out.println(unicodeStr.getBytes().length); // Only one byte, so should fit into a char, right?

char unicode = '\u2674'; // All good so far.
System.out.println((int)unicode); // Returns 9844. WTAF??

System.exit(-1); // Argh! Oh noez... Panic!

Obviously I'm misunderstanding something here, but I don't know what. Please could someone explain why I'm getting the wrong char code? I've tried using codePoints but obviously I don't know what I'm doing with that either. If anyone could please point me in the right direction, I'd be eternally grateful. The objective is to split the string into characters and translate each character into a big letter via a switch case.

*As far as I'm aware, the unicode should fit into a char (2 bytes)*, not really. it depends on the encoding, what if you are using `UTF-32` for example? but even in `UTF-8` it could be more then 2 bytes — Eugene, May 24 '19 at 09:24
I also don't understand what your ultimate goal is... what are you trying to achieve with that "black heart"? a _code point_ is a number that unicode knows how to map - it's like a _huge_ switch statement in a way. A _code point_ is made of many or one _code unit_ (s) that have a certain size depending on the encoding (8 bits for UTF-8, 16 for UTF-16...). — Eugene, May 24 '19 at 09:43
thus, this will give you different results... `String unicodeStr = "\u2674"; // Unicode for black heart. System.out.println(unicodeStr.getBytes(StandardCharsets.UTF_16).length); System.out.println(unicodeStr.getBytes(StandardCharsets.UTF_8).length);` — Eugene, May 24 '19 at 09:44
*"it displays fine with a monospaced font"* Code blocks **use** a mono spaced font! — Andrew Thompson, May 24 '19 at 18:40
@Andrew Thompson - Sorry, but the hearts are not monospaced. Look at the two hearts on the top right, then the heart immediately below it. They are not monospaced. — thefuzzy0ne, May 27 '19 at 12:42

ZumiKua · Accepted Answer · 2019-05-28T10:34:53.780

According to the specification, getBytes() encodes the string using the platform's default charset, which is different from Java's internal encoding, UTF-16. This is why your getBytes() returns a one length byte array.

But in fact, the UTF-16 representation of character '\u2674' can fit into a single char, as 9844 is the decimal representation of hex value 0x2674.

But I still recommend you to use codePoints, because there are some characters which can't be stored inside a single char, for example U+1D161 ().

To iterate a String using codePoints, you can use the following code:

public class Main {

    public static void main(String[] args) {
        String str = "JAVA\uD834\uDD61\u2665";
        int len = str.length();
        for(int i = 0; i < len; ) {
            int cp = str.codePointAt(i);
            i += cp > 0xFFFF ? 2 : 1;

            if(cp == "\u2665".codePointAt(0)) {
                System.out.println("Heart!");
            }
            else if(cp == "\uD834\uDD61".codePointAt(0)){
                System.out.println("Music!");
            }
            else{
                System.out.println((char)cp);
            }
        }
    }

}

The output:

JAVA♥
size: 6
J
A
V
A
Music!
Heart!

Why should we use \uD834\uDD61 to represent U+1D161?

According to wikipedia, in order to represent U+10000 ~ U+10FFFF characters in UTF-16, we need to subtract 0x1D161 with 0x10000, then we got 0x0D161, which is (0000 1101 0001 0110 0001) in binary.

Then, we take the higher ten bits, which is, (0000 1101 00), or 0x034, add 0x034 with 0xD800, we got 0xD834. this is the higher byte of UTF-16 representation of U+1D161.

As for lower ten bits, we get 0x161 + 0xDC00, which is 0xDD61.

There is another problem, String.codePointAt takes char index as the parameter. Sometimes, one code point may take two chars' space, so we should check is current code point larger than 0xFFFF before we increase i.

BTW, if you are using Java 1.8, you can use the new String.codePoints API, which returns a IntStream.

Thanks a lot for this answer. It's going to take me a while to study it and fully comprehend it. The code for a HEAVY BLACK HEART is 0x2674, but I can't remember what font that was for. 0x2665 is actualy BLACK HEART SUIT, which I don't believe is the same thing. I totally forgot that different fonts can represent characters in completely different ways and forgot to mention the font I used. — thefuzzy0ne, May 27 '19 at 12:53
@thefuzzy0ne After reviewing my answer, I found that my previous code was completely wrong. Now the code was fixed, and I added a little explanation, hope it helps. Sorry for the misleading. — ZumiKua, May 28 '19 at 10:30

score 1 · Answer 2 · answered May 24 '19 at 10:34

First the character that you showed in your question is the unicode character HEAVY BLACK HEART or U+2764 so its code is 0x2764.

Then when you convert a character to an int, you get its code point. So yes, (int) '\u2674' is 0x2674 or in decimal 9844. So it is no surprise that you got that.

If you want to print a character, just print it without convertion:

System.out.print(unicode);          // no end of line after the character
System.out.println(unicode);        // character followed with an end of line

score 1 · Answer 3 · answered May 24 '19 at 10:35

1

unicodeStr.getBytes().length is Charset-dependent

Check this one out: Bytes of a string in Java

answered May 24 '19 at 10:35

Viktor Taranenko

257
2
3

Converting unicode character to int gives incorrect code

3 Answers3