How can I get a Unicode character's code?

Question

Let's say I have this:

char registered = '®';

or an umlaut, or whatever unicode character. How could I get its code?

score 118 · Accepted Answer · edited Mar 15 '21 at 19:53

118

Just convert it to int:

char registered = '®';
int code = (int) registered;

In fact there's an implicit conversion from char to int so you don't have to specify it explicitly as I've done above, but I would do so in this case to make it obvious what you're trying to do.

This will give the UTF-16 code unit - which is the same as the Unicode code point for any character defined in the Basic Multilingual Plane. (And only BMP characters can be represented as char values in Java.) As Andrzej Doyle's answer says, if you want the Unicode code point from an arbitrary string, use Character.codePointAt().

Once you've got the UTF-16 code unit or Unicode code points, both of which are integers, it's up to you what you do with them. If you want a string representation, you need to decide exactly what kind of representation you want. (For example, if you know the value will always be in the BMP, you might want a fixed 4-digit hex representation prefixed with U+, e.g. "U+0020" for space.) That's beyond the scope of this question though, as we don't know what the requirements are.

edited Mar 15 '21 at 19:53

bn.

7,739
7
39
54

answered Jan 05 '10 at 14:20

Jon Skeet

1,421,763
867
9,128
9,194

3

@Geo: Anything in the Basic Multilingual Plane, yes. You can't represent characters above U+FFFF in a single char in Java. But a char is effectively defined as a UTF-16 codepoint. – Jon Skeet Jan 05 '10 at 14:26
10

It works for every `char` that represents a Unicode character below `U+FFFF` but not for every Unicode character, since `char` cannot represent all of Unicode. Depending on the source of your `char`, you may need to do something more complex (and really should prepare for it too). – JaakkoK Jan 05 '10 at 14:36
5

And to convert it to hex, use `Integer#toHexString()`. – BalusC Jan 06 '10 at 13:41
1

What if it's outside the Basic Multilingual Plane? – fzzfzzfzz Jul 07 '15 at 17:11
1

@fzzfzzfzz: Then you don't start with it as a single `char` at all, but you can use `char.Convert.ToUtf32`. – Jon Skeet Jul 07 '15 at 17:13
Balus is right. Integer.toHexString() needs to be added, otherwise, for "s" I got 115, what does not make sense for the reader!!! The answer should be edited. – Darius Miliauskas Jul 21 '15 at 11:55
@DariusMiliauskas: The OP wanted the code - which is an integer. This answer shows how to get the *integer*. If the OP wants a hex representation, it makes sense to use `Integer.toHexString`, but that's not what was asked for. What aspect of the question makes you think a string is required at all? – Jon Skeet Jul 21 '15 at 12:01
It is not written in the question that he needs the integer. Usually, the code are presented in the tables as the strings (e. g. http://www.utf8-chartable.de), that's the reason the term "code"as it is written in the question, firstly, understood as a string. – Darius Miliauskas Jul 21 '15 at 12:04
@DariusMiliauskas: That may be your interpretation, but it's certainly not mine. The code *is* an integer. Unicode code point 32 is the same as Unicode code point 0x20, because those both represent the same number. Many "code tables" present the same character with many string representations of the number - the point is that the real entity is the code, which is a number... that has to be converted into a string to *display* it, but that doesn't mean the value is logically a string. (Similarly, my date of birth is a date, not a string... but yes, I need to type it in as a string into forms..) – Jon Skeet Jul 21 '15 at 12:06

Andrzej Doyle · Answer 2 · 2010-01-05T14:50:22.330

A more complete, albeit more verbose, way of doing this would be to use the Character.codePointAt method. This will handle 'high surrogate' characters, that cannot be represented by a single integer within the range that a char can represent.

In the example you've given this is not strictly necessary - if the (Unicode) character can fit inside a single (Java) char (such as the registered local variable) then it must fall within the \u0000 to \uffff range, and you won't need to worry about surrogate pairs. But if you're looking at potentially higher code points, from within a String/char array, then calling this method is wise in order to cover the edge cases.

For example, instead of

String input = ...;
char fifthChar = input.charAt(4);
int codePoint = (int)fifthChar;

use

String input = ...;
int codePoint = Character.codePointAt(input, 4);

Not only is this slightly less code in this instance, but it will handle detection of surrogate pairs for you.

Also, there is the same method in String class, [String#codePointAt](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#codePointAt-int-) — mosov.a, Apr 28 '18 at 08:27

Felype · Answer 3 · 2015-06-11T14:21:56.337

11

In Java, char is technically a "16-bit integer", so you can simply cast it to int and you'll get it's code. From Oracle:

The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

So you can simply cast it to int.

char registered = '®';
System.out.println(String.format("This is an int-code: %d", (int) registered));
System.out.println(String.format("And this is an hexa code: %x", (int) registered));

edited Jun 11 '15 at 14:21

answered Apr 15 '13 at 19:16

Felype

3,087
2
25
36

1

It works even with euro character `String.format("%x", (int) '€') == 0x20ac == '\u20ac'` – ATorras Jun 11 '15 at 13:07

Darius Miliauskas · Answer 4 · 2015-07-21T12:48:27.423

1

For me, only "Integer.toHexString(registered)" worked the way I wanted:

char registered = '®';
System.out.println("Answer:"+Integer.toHexString(registered));

This answer will give you only string representations what are usually presented in the tables. Jon Skeet's answer explains more.

edited Jul 21 '15 at 12:48

answered Jul 21 '15 at 12:00

Darius Miliauskas

3,391
4
35
53

2

As noted in the comments on my answer, that's because "the way you wanted" was to produce a hex representation of the code - which isn't what this question asked. The code itself is an integer; the matter of "How do I create a hex representation of an integer" is a different matter. (For Unicode code points, you should also consider how many hex digits you want - you might want to use 4 for an BMP character and 6 for others, or always 6, or always an even number, for example...) – Jon Skeet Jul 21 '15 at 12:24
It makes the point what u wrote. What makes u think that the code is integer by definition? For me, code is the combination of symbols, not necessarily numbers or integers. Your answer was really very useful, but at the end I spend half an hour while I found how to get the code as I understand, perhaps, it would save some free minutes for other users. – Darius Miliauskas Jul 21 '15 at 12:35
2

That's how Unicode defines it. From http://www.unicode.org/standard/principles.html: "A single number is assigned to each code element defined by the Unicode Standard. Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix "U+". For example, the code point U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character "A" in the Unicode Standard." I've edited my answer to make it clear why the answer to "what is the code for character 'X'" is a number, not a string. – Jon Skeet Jul 21 '15 at 12:38

Michael Gantman · Answer 5 · 2020-10-14T07:47:36.207

There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:

String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);

For example a String "Hello World" will be converted into

"\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"

It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The library could be obtained as a Maven artifact or taken from Github (including source code and Javadoc)

score 0 · Answer 6 · edited Jan 07 '13 at 20:19

0

dear friend, Jon Skeet said you can find character Decimal codebut it is not character Hex code as it should mention in unicode, so you should represent character codes via HexCode not in Deciaml.

there is an open source tool at http://unicode.codeplex.com that provides complete information about a characer or a sentece.

so it is better to create a parser that give a char as a parameter and return ahexCode as string

public static String GetHexCode(char character)
    {
        return String.format("{0:X4}", GetDecimal(character));
    }//end

hope it help

edited Jan 07 '13 at 20:19

Imaky

1,227
1
16
36

answered Jan 06 '10 at 13:39

Nasser Hadjloo

12,312
15
69
100

1

"so you should represent character codes via HexCode not in Deciaml" - it's a number. Hex vs decimal only comes into play when converting this to a string, and there's no requirement for that within the question at all. – Jon Skeet Jul 21 '15 at 12:02
1

How do you think posting a link to a C#, plus some code for C# is helping the op with his Java problem? – Ferrybig Apr 13 '18 at 21:22

Yokubboy Yokubov · Answer 7 · 2021-05-24T14:49:54.970

-1

//You can get unicode below

int a = 'a'; // 'a' is a letter or symbol you want to get its unicode

//You can get symbel or letter below by its unicode

System.out.println("\123"); //123 is an unicode you want to transfer

edited May 24 '21 at 14:49

answered May 24 '21 at 14:44

Yokubboy Yokubov

1
1

How can I get a Unicode character's code?

7 Answers7

Linked

Related