Creating Unicode character from its number

Question

I want to display a Unicode character in Java. If I do this, it works just fine:

String symbol = "\u2202";

symbol is equal to "∂". That's what I want.

The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:

int c = 2202;
String symbol =  "\\u" + c;

However, in this case, symbol is equal to "\u2202". That's not what I want.

How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?

Remove the first backslash, so that instead of escaping the backslash it escapes the Unicode sequence. Using "\\" tells Java that you want to print out "\", not use it as past of an escape sequence for Unicode characters. If you remove the first one then it will instead escape the Unicode sequence and not the second backslash. At least, it will to the best of my knowledge. — Nic, Apr 25 '13 at 17:52
You can simply convert `int` to `char` the following way: `char ch = (char)c;`. You may create a string like this: `String symbol = "" + (char)c;`. When adding a character to an existing string, this type of conversion should be the easiest way. Example: `String text = "You typed the following character: " + (char)c;` — Martin Rosenau, Sep 17 '21 at 13:50

score 135 · Answer 1 · edited Nov 25 '14 at 02:31

135

If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.

If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.

Doc says:

Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.

edited Nov 25 '14 at 02:31

Basil Bourque

303,325
100
852
1,154

answered Apr 08 '11 at 16:56

McDowell

107,573
31
204
267

While this is a more general solution and in many cases you should use this over the accepted answer, the accepted answer is a closer match to the specific problem Paul asked for. – Jochem Kuijpers Dec 03 '16 at 02:09
2

Firstly, thanks! In Scala, I am still unable to parse characters that are larger than a `char`. ```scala> "‍".map(_.toInt).flatMap((i: Int) => Character.toChars(i)).map(_.toHexString)``` gives ```res11: scala.collection.immutable.IndexedSeq[String] = Vector(f468, 200d, f3a8)``` This emoji, "male singer", is addressed with the three code points `U+1f468`, `U+200d` and `U+1f3a8`. The most significant digit is missing. I can add it with a bitwise OR (https://stackoverflow.com/a/2220476/1007926), but don't know how to determine which parsed characters have been truncated. Thanks! – Peter Becich Jan 10 '18 at 03:10
1

@JochemKuijpers I don't agree that _"the accepted answer is a closer match to the specific problem"_. The OP explicitly asked _"How can I construct the symbol **if I know its Unicode number**...?"_, and the accepted answer cannot work if that _"Unicode number"_ is outside the BMP. For example, the accepted answer fails for the valid codepoint 0x1040C because it is in the SMP. It is a poor answer, and should be corrected or deleted. – skomisa Dec 21 '19 at 05:55
@skomisa OPs scenario is limited to the representation of hexadecimal Unicode escape sequence. If you have a character that should be encoded as a surrogate pair, then that is reflected in these escape sequences, so it still works out in the end. As I said, this is a more general solution and you should use this. – Jochem Kuijpers Dec 21 '19 at 17:26

dty · Accepted Answer · 2011-04-07T21:30:35.373

82

Just cast your int to a char. You can convert that to a String using Character.toString():

String s = Character.toString((char)c);

EDIT:

Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.

edited Apr 07 '11 at 21:30

answered Apr 07 '11 at 18:50

dty

18,795
6
56
82

5

That's just giving me a square box, ࢚. It's not giving me "∂". – Paul Reiners Apr 07 '11 at 18:54
That's probably because the console you're printing it to doesn't understand the character encoding you're writing out, or else the console font doesn't have a glyph for the character. – dty Apr 07 '11 at 19:00
But the same console is printing out "∂". – Paul Reiners Apr 07 '11 at 19:01
Like this: String symbol = "\u2202"; The problem is that I only know the value 2202 at run-time. – Paul Reiners Apr 07 '11 at 19:12
29

Danger, Will Robinson! Don't forget that Unicode code points *will not necessarily fit in a char*. So you need to be absolutely sure ahead of time that your value of `c` is smaller than 0x10000, or else this approach will break horribly. – David Given Mar 13 '12 at 22:29
@DavidGiven Actually, less than 0xFFFF because that is the max of the hexadecimal Unicode. 10000 is a decimal number. 0xFFFF is equivalent to 65535 in decimal. – Nic Apr 25 '13 at 17:54
1

@NickHartley Sorry, don't follow --- did you misread 0x10000 for 10000? – David Given Apr 25 '13 at 21:20
@DavidGiven Yes, I did. Sorry about that :P. But either way the max for a single Unicode character is 0xFFFF, not 0x10000 – Nic Apr 26 '13 at 22:20
11

That's why I said 'below'! And I need to emphasise that, despite the fact that Java chars only go up to 0xffff, Unicode code points go up to 0xfffff. The Unicode standard got changed after Java was designed. These days Java chars technically hold UTF-16 words, not Unicode code points, and forgetting this will cause hideous breakage when your application encounters an exotic script. – David Given Apr 27 '13 at 15:18
3

@DavidGiven thanks for `Java chars go up to 0xFFFF`. I did not know that. – Tony Ennis Aug 29 '13 at 12:21
@PaulReiners I'm facing the same problem. How did you fix it? – parsecer Jul 24 '19 at 14:12
You can always do: `new String(Character.toChars(0x1F4F7))` if it is > FFFF - Converts the Codepoint to char[] and then to string. – Pierre Jul 26 '19 at 10:45
I don't remember. – Paul Reiners Jul 28 '19 at 22:56
note that Unicode hasn't "fit" in a `char` since Unicode 3, so this answer does not work for Unicode (not even back in 2011), it only works for the subset of Unicode whose codepoints fit in 16 bits. Which is a smaller subset than the subset of Unicode codepoints that require more than 16 bits. – Mike 'Pomax' Kamermans Nov 04 '20 at 16:26

eis · Answer 3 · 2021-07-10T10:23:16.680

The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.

To support supplementary code points also, this is what needs to be done:

// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);

// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));

I also did a quick test as to which conversion methods work and which don't

int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);

System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0));        // 91, didn't work
System.out.println(new String(charPair).codePointAt(0));       // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0));  // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
                                                               // 128149, worked

--

Note: as @Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.

How come it doesn't work as a one-liner? `new String(Character.toChars(121849));` breaks in the Eclipse console, but the three-line version works. — Noumenon, Jun 29 '16 at 14:02
@Noumenon can't reproduce the issue, works equally fine for me — eis, Jun 29 '16 at 21:32
Kudos for going further. For the `str4` assignment, shouldn't `code` be `codePoint` instead? — skomisa, Dec 21 '19 at 06:18
As of Java 11 we now have `Character.toString(int codePoint)`. — Axel, Jul 07 '21 at 13:32

score 7 · Answer 4 · answered Nov 27 '13 at 10:09

7

This one worked fine for me.

  String cc2 = "2202";
  String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));

Now text2 will have ∂.

answered Nov 27 '13 at 10:09

MeraNaamJoker

91
1
3

score 6 · Answer 5 · answered Jul 24 '17 at 06:06

6

String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.

answered Jul 24 '17 at 06:06

Kapil K. Kushwah

61
1
1

1

While this post might answer the question, an explanation is required as to what you are doing; to improve the quality and readability of your answer – Ajil O. Jul 24 '17 at 06:14
1

Thanks, it really helped me! Works fine and is easier than other solutions here (really, Java people soo like to overcomplicate things). – parsecer Jul 24 '19 at 14:32

ILMTitan · Answer 6 · 2011-04-07T21:22:00.010

6

Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.

char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);

edited Apr 07 '11 at 21:22

answered Apr 07 '11 at 18:49

ILMTitan

10,751
3
30
46

That's just giving me a square box, ࢚. It's not giving me "∂". – Paul Reiners Apr 07 '11 at 18:52
3

That is because 2202 is not the `int` you were looking for. You were looking for 0x2202. My fault. In any case, if you have the `int` of the code point you are looking for, you can just cast it to a `char`, and use it (to construct a `String` if you wish). – ILMTitan Apr 07 '11 at 21:23

skomisa · Answer 7 · 2019-06-19T00:30:26.737

Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():

public static String toString(int codePoint)

Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.

Parameters:
codePoint - the codePoint to be converted

Returns:
the string representation of the specified codePoint

Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.

Since:
11

Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.

The code needed for the example given in the question is simply:

    int codePoint = '\u2202';
    String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
    System.out.println(s); // Prints ∂

This approach offers several advantages:

It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].

Some additional clarification on this one as this answer did make it immediately obvious to me how to create the codePoint variable. The syntax here should be: `int codePoint = 0x2202;` Then: `String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!` Or in a one-liner: `System.out.println(Character.toString(0x2202)); // Prints ∂` Hope this helps someone else using this feature of JDK 11. — Loathian, Apr 22 '20 at 22:14
@Loathian It is false that _"The syntax here should be: `int codePoint = 0x2202;`"_ It shouldn't. I used a Unicode escape in this answer precisely because the OP was using it in the question, which was about creating a Unicode character! Unicode escapes are a basic feature of Java - see [3.3. Unicode Escapes](https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.3) in the Java Language Specification. But feel free to post your own answer if you think your approach is more appropriate. — skomisa, Feb 07 '23 at 17:34

score 2 · Answer 8 · edited May 05 '15 at 14:09

2

This is how you do it:

int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);

This solution is by Arne Vajhøj.

edited May 05 '15 at 14:09

ATorras

4,073
2
32
39

answered Apr 07 '11 at 19:11

Paul Reiners

8,576
33
117
202

Are you saying this works? If so, this works because you're reinterpreting two-thousand, two-hundred and two as 0x2202, which is, of course, not the same thing at all. – dty Apr 07 '11 at 20:08
4

Oh, no, hang on! The Unicode values (the \u escape sequences in Java source) ARE hex! So this is right. You just misled everyone by saying `int c = 2202`, which is wrong! A better solution than this is simple to say `int c = 0x2202` which will save you going via a String, etc. – dty Apr 07 '11 at 20:09
3

+1 @dty: There is absolutely no call for the middle `char ccc...` line. Just use `int cc = 0x2202;` and then `final String text=String.valueOf(cc);` – Andrew Coonce Jan 26 '15 at 19:51

score 1 · Answer 9 · answered Jun 09 '15 at 21:38

The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars! The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.

// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs 
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print 

String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2];  // 2.  two chars per unicode

int ii = 0;
for (String intString : intsInStrs) {
    // 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
    Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
    ++ii; // advance to the next char
}

String symbols = new String(c2s);  // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works.  Enjoy

score 1 · Answer 10 · edited Oct 29 '16 at 17:05

1

Here is a block to print out unicode chars between \u00c0 to \u00ff:

char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
    for (int j = 0; j < 16; j++) {
        String sc = new String(ca);
        System.out.print(sc + " ");
        ca[0]++;
    }
    System.out.println();
}

edited Oct 29 '16 at 17:05

DimaSan

12,264
11
65
75

answered Oct 28 '16 at 19:23

fjiang_ca

11
1

score 0 · Answer 11 · answered May 28 '14 at 15:34

Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.

  /**
   * Converts character to the mimic unicode format i.e. '\\u0020'.
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param ch  the character to convert
   * @return is in the mimic of escaped unicode string, 
   */
  public static String unicodeEscaped(char ch) {
    String returnStr;
    //String uniTemplate = "\u0000";
    final static String charEsc = "\\u";

    if (ch < 0x10) {
      returnStr = "000" + Integer.toHexString(ch);
    }
    else if (ch < 0x100) {
      returnStr = "00" + Integer.toHexString(ch);
    }
    else if (ch < 0x1000) {
      returnStr = "0" + Integer.toHexString(ch);
    }
    else
      returnStr = "" + Integer.toHexString(ch);

    return charEsc + returnStr;
  }

  /**
   * Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
   * notice: i cannot use real unicode format, because this is immediately translated
   * to the character in time of compiling and editor (i.e. netbeans) checking it
   * instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
   * as a string, but it doesn't gives the same results, of course
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param String - nationalString in the UTF8 string to convert
   * @return is the string in JAVA unicode mimic escaped
   */
  public String encodeStr(String nationalString) throws UnsupportedEncodingException {
    String convertedString = "";

    for (int i = 0; i < nationalString.length(); i++) {
      Character chs = nationalString.charAt(i);
      convertedString += unicodeEscaped(chs);
    }
    return convertedString;
  }

  /**
   * Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param String - nationalString in the JAVA unicode mimic escaped
   * @return is the string in UTF8 string
   */
  public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
    String convertedString = "";

    String[] arrStr = escapedString.split("\\\\u");
    String str, istr;
    for (int i = 1; i < arrStr.length; i++) {
      str = arrStr[i];
      if (!str.isEmpty()) {
        Integer iI = Integer.parseInt(str, 16);
        char[] chaCha = Character.toChars(iI);
        convertedString += String.valueOf(chaCha);
      }
    }
    return convertedString;
  }

score -1 · Answer 12 · answered Oct 20 '16 at 03:34

-1

char c=(char)0x2202; String s=""+c;

answered Oct 20 '16 at 03:34

dave110022

64
5

score -7 · Answer 13 · edited Jun 21 '15 at 15:23

(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)

I am from West Bengal in INDIA. As I understand your problem is ... You want to produce similar to ' অ ' (It is a letter in Bengali language) which has Unicode HEX : 0X0985.

Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?

In Dot Net it is as simple as this :

int c = 0X0985;
string x = Char.ConvertFromUtf32(c);

Now x is your answer. But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

question is indeed for java. I don't see how .NET answer is related here. — eis, Jul 24 '15 at 14:08

Creating Unicode character from its number

13 Answers13

Linked

Related