1

another UTF-8 related problem. Chinese characters in Java encoded with 'UTF-8' some times become 3 bytes long when encoded. I don't know why, I thought all Chinese characters their code points are all 2 bytes wide. but when I manually try to detect that it seems doesn't turn out that way either. is there a way to detect the byte width (non zero bytes ) of the UTF-8 character ?

import java.io.UnsupportedEncodingException;
public class a {


public static void main(String[] args) throws UnsupportedEncodingException {
    String s = "我是一1"; //expected 7 actually 6
    String s1 = "一1";
    String s2 = "1";

    //String r1 = "\\p{InCJK_Compatibility}";
    //String r1 = "\\p{InCJK_Compatibility_Ideographs}";
    //String r1 = "\\p{Han}"; //unfortunately not supported in java6

    int cnt = 0;
    final int length = s.length();
    for (int offset = 0; offset < length; ) {
        final int codepoint = s.codePointAt(offset);
        if( (codepoint & 0xFF) > 0 ) cnt++;
        if( (codepoint & 0xFF00) > 0 ) cnt++;
        if( (codepoint & 0xFF0000) > 0 ) cnt++;
        if( (codepoint & 0xFF000000) > 0 ) cnt++;
        offset += Character.charCount(codepoint);
    }

        System.out.println( cnt );
    }
}
zinking
  • 5,561
  • 5
  • 49
  • 81
  • 1
    No Chinese character takes 2 bytes in UTF-8, they take 3 or 4 bytes. You probably confused UTF-8 with GB/GBK/Big5/ShiftJIS/EUC families of encodings, which do have that property and are commonly used in Asia. – Karol S Aug 22 '14 at 12:07
  • @KarolS `U+6A5F is 機` I assume the +6A5F is the code point, citing from http://stackoverflow.com/questions/4596576/simplified-chinese-unicode-table even in my test case, you can see that the ‘一’ character actually takes 1 byte in encoding, I am really confused. – zinking Aug 22 '14 at 15:05
  • U+6A5F in UTF-8 is `E6 A9 9F`. Also, your algorithm for counting UTF-8 bytes is wrong. First, `c=0x10000` won't trigger `(c&0xFF00)>0)`, or even `(c&0xFF)>0`. Second, places where the number of bytes actually goes up are not 0x100, 0x10000 and 0x1000000, but 0x80, 0x800 and 0x10000. – Karol S Aug 22 '14 at 15:13
  • 1
    And if you want to count codepoints, not bytes... then you should only do one `cnt++` in the loop, without any `if`s. `s` has exactly 4 codepoints and requires 10 bytes in UTF-8. – Karol S Aug 22 '14 at 15:15
  • Besides, I also fail to understand what you want to achieve. Can you explain what do you need and why do you need that? – Karol S Aug 22 '14 at 15:25
  • @KarolS your comment clarified a few things. originally was just trying to validate Chinese characters, but then find something mismatch with my understanding. like the code point. one last thing here is the codepoint width of ‘一’ , in my code it is still printing out 1 instead of expected 2. I guess some problem with the code. – zinking Aug 22 '14 at 15:32
  • 一 (U+4E00) is one codepoint, which requires one UTF-16 code unit (known in Java as `char`; some less commonly used codepoints need two `char`s) or 3 bytes in UTF-8 (`E4 B8 80`; a codepoint can use between 1 and 4 bytes in UTF-8). Create a new question that explains what kind of validation you want and I'll be happy to help. – Karol S Aug 22 '14 at 15:37

3 Answers3

0

UTF-8 character can be one to four bytes long. One way to find size of a UTF-8 character is to convert the char (string) to a byte array and check the array's length, if that's what you're asking:

myString.getBytes(Charset.forName("UTF-8")).length;
Boris B.
  • 4,933
  • 1
  • 28
  • 59
0

This is supposed to show the length of each character in a string encoded into UTF-8

    for (int i = 0; i < s.length(); ) {
        int cp = s.codePointAt(i);
        int l = new String(Character.toChars(cp)).getBytes("UTF-8").length;
        System.out.println(l);
        i += Character.charCount(cp);
    }

to count number of non-zero bytes in a code-point we can use this formula

int l = (31 - Integer.numberOfLeadingZeros(x)) / 8 + 1;
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • this returns 3 3 3 1 for my case. which fit my expectation this is output the encoded byte length not the code point length. – zinking Aug 22 '14 at 09:58
0

Unicode is a numbering of characters upto the three byte range, called code points.

UTF-16 (UTF-16LE and UTF-16BE) use two bytes, but for some Unicode points need an escape combination (4 bytes). char uses UTF-16BE. It still cannot in every case represent an entire Unicode code point.

UTF-8 uses one byte for plain ASCII (0 .. 127, seven bits). For higher code points it splits the bits of the Unicode codepoint over several bytes, where the higher bits are fixed. The highest bit always being 1 so no mistaking with an ASCII char is possible.

int byteCount(int codePoint) {
    int[] codePoints = new int[] { codePoint };
    String s = new String(codePoints, 0, codePoints.length);
    int byteCount = s.getBytes(StandardCharsets.UTF_8).length;
    return byteCount;
}

This java code is self-explanatory. The class StandardCharsets contains Charset constants for all encodings that are standard = always available in every java distribution. As such one does not need to handle an UnsupportedEncodingException.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I assume you return `byteCount` and `getBytes` return array – zinking Aug 22 '14 at 09:53
  • returned 3,3,3,1 for my test case, but I think those Chinese characters, they should be only 2 bytes wide in terms of code point byte width. – zinking Aug 22 '14 at 10:12
  • Some Chinese should be 3 bytes, but even a 16 bit number may be represented as three bytes: 1110.... + 10...... + 10...... See http://en.wikipedia.org/wiki/UTF-8 – Joop Eggen Aug 22 '14 at 10:22