14

How can I get the unicode value of a string in java?

For example if the string is "Hi" I need something like \uXXXX\uXXXX

user489041
  • 27,916
  • 55
  • 135
  • 204
  • 3
    Why? What **exactly** are you trying to do? `charAt()` will help. If you want Unicode codepoints instead of UTF-16 code units, then `codePointAt()` is the more correct approach (but that won't help if you want to write `\u` escapes for Java source code or similar). – Joachim Sauer Apr 20 '11 at 17:01
  • To simplify everything, I have a string that is in English from a java source file. It gets converted to Japanese. I then need the \uXXXX unicode value because the English string will be replaced with the Japanese in the source file. – user489041 Apr 20 '11 at 17:05
  • @user: in that case formatting the value return by `charAt()` as a 4-digit hex number and prepending `\u` should work. – Joachim Sauer Apr 20 '11 at 17:07

2 Answers2

20

Some unicode characters span two Java chars. Quote from http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html :

The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.

correct way to escape non-ascii:

private static String escapeNonAscii(String str) {

  StringBuilder retStr = new StringBuilder();
  for(int i=0; i<str.length(); i++) {
    int cp = Character.codePointAt(str, i);
    int charCount = Character.charCount(cp);
    if (charCount > 1) {
      i += charCount - 1; // 2.
      if (i >= str.length()) {
        throw new IllegalArgumentException("truncated unexpectedly");
      }
    }

    if (cp < 128) {
      retStr.appendCodePoint(cp);
    } else {
      retStr.append(String.format("\\u%x", cp));
    }
  }
  return retStr.toString();
}
Raghu A
  • 216
  • 2
  • 2
12

This method converts an arbitrary String to an ASCII-safe representation to be used in Java source code (or properties files, for example):

public String escapeUnicode(String input) {
  StringBuilder b = new StringBuilder(input.length());
  Formatter f = new Formatter(b);
  for (char c : input.toCharArray()) {
    if (c < 128) {
      b.append(c);
    } else {
      f.format("\\u%04x", (int) c);
    }
  }
  return b.toString();
}
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    @user489041: I disagree: The right way to do this is to compile with `java -encoding UTF-8`. No mess, no fuss. This is especially because 20 years on, Java still has no standard way to talk about code points by their official names. That means you are trying to insert evil and mysterious magic numbers in your code. That is not a good thing! Sure, I might rather see "\N{GREEK SMALL LETTER ALPHA}" than "α", but I **SURELY** do not want to see "\u03B1"! That’s just wicked. How are you going to maintain that kind of crudola? – tchrist Apr 23 '11 at 22:40
  • 3
    @Martin: 1.) strictly speaking "Unicode" is not an n-bit character set for any value of n. 2.) most Japanese characters fall into the basic multilingual pane (the first 64k Unicode codepoints) and can be represented with just 4 hexadecimal digits and 3.) the unicode escapes in Java use UTF-16, so if you have to present anything outside the BMP, you'll have to use two \u escapes (with the correct surrogate values) which is incidentally what my code does because a `char` is really a UTF-16 codepoint and not a Unicode codepoint (those two are the same thing, *iff* the character is in the BMP). – Joachim Sauer Aug 28 '12 at 11:42