1

I'm trying to convert a Java string contains Unicode character in CJK ExtB plan to Decimal NCRs.

For example (you could try it with http://people.w3.org/rishida/tools/conversion/ ):

  • "游鍚堃" should convert to 游鍚堃
  • "懷" should convert to 𧦧懷

Here is what I tried (in Scala):

def charToHex(char: Char) = "&#%d;" format(char.toInt)
def stringToHex (string: String) = string.flatMap(charToHex)

println (stringToHex("游鍚堃")) // 游鍚堃
println (stringToHex("懷"))   // ��懷
println ("懷".toCharArray().length) // Why it is 3?

As you can see, it convert correctly in the first case, three unicode characters to three NCRs.

But in the second case "懷", there are only two unicode characters, but Java/Scala seems to think it is a string contains three characters.

So, what is happening here and how could I convert the second case correctly just like the converter on the site I mentioned? Thanks a lot.

Update:

  • My source code file is using UTF-8.
  • Here is the result of "懷".toCharArray()
    • char[] = ?, char.toInt = 55390
    • char[] = ?, char.toInt = 56743
    • char[] = 懷, char.toInt = 25079

Now I think I know what happened. The character "" is encoded as 0xD85E 0xDDA7 in UTF-16, which is 4 bytes instead of 2 bytes. So it takes 2 elements when convert to array of char, where data type char could only represent 2 bytes.

Brian Hsu
  • 8,781
  • 3
  • 47
  • 59

3 Answers3

7

Java (and therefore Scala) use UTF-16 encoding for their string, which means that all unicode code points above 2^16-1 must be represented with two characters. (Actually, the encoding scheme is a bit more complex than that.) Anyway, length is a method that operates at a lower level--characters--so it returns the number of characters.

If you want to find out the number of code points, which is what you probably are thinking of intuitively when you say "two unicode characters" (e.g. two symbols that print out), you need to use s.codePointCount(0,s.length). And if you want to convert those to hex, you need to be working with code points not Chars, since not all code points fit. My answer to this question contains Scala code to convert a string to code points. (Not with maximal efficiency; you'd want to rewrite it to use arrays/ArrayBuffer if you're doing heavy-duty text processing on large strings.)

Community
  • 1
  • 1
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
2

It is what they called "surrogate" in unicode speak. For instance,

"懷" foreach { c =>
  println(java.lang.Character.UnicodeBlock.of(c))
}

prints

HIGH_SURROGATES
LOW_SURROGATES
CJK_UNIFIED_IDEOGRAPHS

BTW, I am based in Taiwan as well. If you are interested in Scala, we should get together and talk shop. My email is in my profile if you are interested.

Walter Chang
  • 11,547
  • 2
  • 47
  • 36
0

Check the file encoding. Your IDE or your build script must know that the file is either UTF-8 or UTF-16 (which one do you use?). If you define BOM then check that it is appropriate.

Andrey
  • 2,931
  • 22
  • 18