Reverse Parse Multi-Byte

Question

I want to determine whether the last character in the buffer defined as the bytes between begin and end is English or Japanese. I read about uTF-8 where Japanese characters are two bytes long and always have 1 in the high bit of the high byte, whereas low byte can have either 1 or 0 in the high bit.

I am trying to return integer 2 for Japanese(2Bytes), 1 for English and 0 for data in buffer is malformed.

public static int NumChars(byte begin, byte end). Can you point me to the right direction? I am confused how to approach this. I was thinking about using xor to find if the MSB in high bit is 1 then return 2, but I have a doubt even if I understood correctly.

score 0 · Answer 1 · answered Apr 01 '21 at 16:54

0

Jeevan UTF-8 character byte length can be between 1 to 4 bytes.

so If you want to print 2 for Japanese characters please use this unicode.

SJIS

Example:--

String j = "大";     
System.out.println(j.getBytes("SJIS").length);

answered Apr 01 '21 at 16:54

Raushan Kumar

1,195
12
21

the Hutt · Answer 2 · 2021-04-01T17:43:32.713

There is a discussion about this on this thread guessing-the-encoding-of-text-represented-as-byte-in-java

If you can get the buffer or part of it in string form. Then you can use regular expressions to match the character sets like this:

   String english = ".*[\\x{20}-\\x{7E}]$";
   String hiragana = ".*[\\x{3041}-\\x{3096}]$";
   
   byte[] buffer = {97, 98, 99, -29, -127, -126}; //"abcあ"
   System.out.println("buffer: "+Arrays.toString(buffer));
   String s = new String(buffer,"utf-8") ;

   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

   s = "abcd";
   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

Output:

buffer: [97, 98, 99, -29, -127, -126]
abcあ is hiragana=true
abcあ is english=false
abcd is hiragana=false
abcd is english=true

You will have to find out which Japanese character sets your program is using like Kenji, Hiragana, Katakana etc. For more information read this article: regular-expressions-for-japanese-text

Reverse Parse Multi-Byte

2 Answers2