I'm trying to convert a Java string contains Unicode character in CJK ExtB plan to Decimal NCRs.
For example (you could try it with http://people.w3.org/rishida/tools/conversion/ ):
- "游鍚堃" should convert to
游鍚堃
- "懷" should convert to
𧦧懷
Here is what I tried (in Scala):
def charToHex(char: Char) = "&#%d;" format(char.toInt)
def stringToHex (string: String) = string.flatMap(charToHex)
println (stringToHex("游鍚堃")) // 游鍚堃
println (stringToHex("懷")) // ��懷
println ("懷".toCharArray().length) // Why it is 3?
As you can see, it convert correctly in the first case, three unicode characters to three NCRs.
But in the second case "懷", there are only two unicode characters, but Java/Scala seems to think it is a string contains three characters.
So, what is happening here and how could I convert the second case correctly just like the converter on the site I mentioned? Thanks a lot.
Update:
- My source code file is using UTF-8.
- Here is the result of "懷".toCharArray()
char[] = ?, char.toInt = 55390
char[] = ?, char.toInt = 56743
char[] = 懷, char.toInt = 25079
Now I think I know what happened. The character "" is encoded as 0xD85E 0xDDA7 in UTF-16, which is 4 bytes instead of 2 bytes. So it takes 2 elements when convert to array of char, where data type char
could only represent 2 bytes.