18

I need to generate the hexadecimal code of Java characters into strings, and parse those strings again later. I found here that parsing can be performed as following:

char c = "\u041f".toCharArray()[0];

I was hoping for something more elegant like Integer.valueOf() for parsing.

How about generating the hexadecimal unicode properly?

Community
  • 1
  • 1
Jérôme Verstrynge
  • 57,710
  • 92
  • 283
  • 453

3 Answers3

18

This will generate a hex string representation of the char:

char ch = 'ö';
String hex = String.format("%04x", (int) ch);

And this will convert the hex string back into a char:

int hexToInt = Integer.parseInt(hex, 16);
char intToChar = (char)hexToInt;
noel
  • 2,257
  • 3
  • 24
  • 39
  • First one gives me > Cannot cast from char[] to int – Machado Feb 25 '15 at 14:02
  • @Holmes I had no problem using openjdk 1.8.0_65 and javac 1.8.0_60. Either using the above or `char c = '\u041f';` (which is П) or `\u4e2d' (which is 中). I couldn't compile with a Mahjong tile '' (which is out of the basic multilingual plane, and thus not representable by char so it is not surprising). – Eponymous Dec 11 '15 at 16:26
7

After doing some deeper reading, the javadoc says the Character methods based on char parameters do not support all unicode values, but those taking code points (i.e., int) do.

Hence, I have been performing the following test:

    int codePointCopyright = Integer.parseInt("00A9", 16);

    System.out.println(Integer.toHexString(codePointCopyright));
    System.out.println(Character.isValidCodePoint(codePointCopyright));

    char[] toChars = Character.toChars(codePointCopyright);
    System.out.println(toChars);

    System.out.println();

    int codePointAsian = Integer.parseInt("20011", 16);

    System.out.println(Integer.toHexString(codePointAsian));
    System.out.println(Character.isValidCodePoint(codePointAsian));

    char[] toCharsAsian = Character.toChars(codePointAsian);
    System.out.println(toCharsAsian);

and I am getting:

enter image description here

Therefore, I should not talk about char in my question, but rather about array of chars, since Unicode characters can be represented with more than one char. On the other side, an int covers it all.

Jérôme Verstrynge
  • 57,710
  • 92
  • 283
  • 453
  • Well you're right to talk about char in your question, it's Java that's broken and forces the coder to meddle with strings at encoding detail level WRT to Unicode supplementary. – Basel Shishani Oct 26 '14 at 00:26
  • @BaselShishani Java isn't "broken". Unicode had no supplementary planes when Java first came out, and a char could handle any Unicode code point. Conversions between the various encodings and Java primitives can certainly be confusing sometimes, but efficiently representing all the characters for all the world's languages (and more) is inherently complex, and Unicode is still constantly evolving. There is a point at which you can't and shouldn't mask the complexities of processing Unicode data from the user. Don't blame Java for that. – skomisa Dec 20 '19 at 21:56
5

On String level: The following uses not char but int, say for Chinese, but is also adequate for chars.

    int cp = "\u041f".codePointAt(0);
    String s = new String(Character.toChars(cp));

On native2ascii level: If you want to convert back and forth between \uXXXX and Unicode character, use from apache, commons-lang the StringEscapeUtils:

    String t = StringEscapeUtils.escapeJava(s + "ö");
    System.out.println(t);

On the command-line native2ascii can convert back and forth files between u-escaped and say UTF-8.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138