How do I get the decimal value of a unicode character in Java?

Question

I need a programmatic way to get the decimal value of each character in a String, so that I can encode them as HTML entities, for example:

UTF-8:

著者名

Decimal:

&#33879;&#32773;&#21517;

There's no such thing as a "UTF-8 character" or "decimal encoding". "UTF-8" is an encoding, and "decimal" is a number base. — Kerrek SB, Jul 20 '11 at 18:10
Cheers. I don't know Java, but since your characters are in the BMP, they're just the literal values of the string elements (Java has 16-bit strings) -- can't you just say `str[0]`, etc.? — Kerrek SB, Jul 20 '11 at 18:16

Jon Skeet · Accepted Answer · 2011-07-20T18:39:51.757

14

I suspect you're just interested in a conversion from char to int, which is implicit:

for (int i = 0; i < text.length(); i++)
{
    char c = text.charAt(i);
    int value = c;
    System.out.println(value);
}

EDIT: If you want to handle surrogate pairs, you can use something like:

for (int i = 0; i < text.length(); i++)
{
    int codePoint = text.codePointAt(i);
    // Skip over the second char in a surrogate pair
    if (codePoint > 0xffff)
    {
        i++;
    }
    System.out.println(codePoint);
}

edited Jul 20 '11 at 18:39

answered Jul 20 '11 at 18:15

Jon Skeet

1,421,763
867
9,128
9,194

What does `codePointAt` do if you point it into the middle of a surrogate couple? – Kerrek SB Jul 20 '11 at 19:46
@Kerrek: I believe it returns the low surrogate, i.e. it only acts differently when it finds a high surrogate. – Jon Skeet Jul 20 '11 at 19:52

score 2 · Answer 2 · answered Jul 20 '11 at 18:26

2

Ok after reading Jon's post and still musing about surrogates in Java, I decided to be a bit less lazy and google it up. There's actually support for surrogates in the Character class it's just a bit.. roundabout

So here's the code that'll work correctly, assuming valid input:

    for (int i = 0; i < str.length(); i++) {
        char ch = str.charAt(i);
        if (Character.isHighSurrogate(ch)) {
            System.out.println("Codepoint: " + 
                   Character.toCodePoint(ch, str.charAt(i + 1)));
            i++;
        }
        System.out.println("Codepoint: " + (int)ch);
    }

answered Jul 20 '11 at 18:26

Voo

29,040
11
82
156

There's actually some support in String as well, which makes this a bit simpler - see my revised answer. – Jon Skeet Jul 20 '11 at 18:40
@Jon Skeet: Yeah the string method is decidedly nicer. And I agree that it's not really a problem most of the time - apart from ancient scripts (hardly of interest for the general public), the only thing of interest there seem to be some mathematical and musical symbols. But it's such a rare chance to be able to nitpick a bit ;) – Voo Jul 20 '11 at 19:11
Well, the fundamental problem remains that UTF-16 is a variable-width encoding and that there's no O(1) algorithm to determine the number of codepoints in a string, or even to check that the string is a valid sequence of Unicode codepoints. – Kerrek SB Jul 20 '11 at 22:32

How do I get the decimal value of a unicode character in Java?

2 Answers2

Linked