0

I ran into struggles converting a byte array korean chars in Java. Wikipedia states that somehow 3 bytes are beeing used for each char, but not all bits are taken into account.

Is there a simple way of converting this very special...format? I don't want to write loops and counters keeping track of bits and bytes, as it would get messy and I can't imagine that there is no simple solution. A native java lib would be perfect, or maybe someone figured some smart bitshift logic out.

UPDATE 2: A working solution has been posted by @DavidConrad below, I was wrong assuming it is UTF-8 encoded.

UPDATE:

These bytes

[91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50]

should output this:

[공사] 율곡로

But using

new String(shortStrBytes,"UTF8"); // or
new String(shortStrBytes,StandardCharsets.UTF_8);

turns them to this:

[����] �����
The returned string has 50% more chars
Yesyoor
  • 157
  • 1
  • 18
  • What's wrong with `new String(koreanBytes, StandardCharsets.UTF_8)`? – David Conrad Feb 28 '22 at 17:15
  • It does not work, it ends up with 50% more chars than there should be also all chars are replaced by rectangles. I think it used 3 bytes per char and that is why it would be 50% longer using new String(koreanBytes, StandardCharsets.UTF_8) – Yesyoor Feb 28 '22 at 17:18
  • 2
    It doesn't sound like your data is actually UTF-8. You say "converting a byte array to UTF_8 korean chars" but what are you converting FROM? Also, chars are not UTF-8, they are UTF-16. Java always uses UTF-16 to represent Unicode internally. – David Conrad Feb 28 '22 at 17:20
  • @DavidConrad It's no longer the case that _"Java always uses UTF-16 to represent Unicode internally"_. See [this SO answer](https://stackoverflow.com/a/9699138/2985643) to the question "[What is the Java's internal representation for String? Modified UTF-8? UTF-16?](https://stackoverflow.com/q/9699071/2985643)". – skomisa Feb 28 '22 at 18:56
  • @skomisa Yes, I know that it can store it as Latin1 internally but that's an implementation detail that isn't really visible. If you access the chars, you still get 16-bit values back. – David Conrad Feb 28 '22 at 20:06
  • @DavidConrad a colleague told me that using UTF-8 in C# would decode it correctly. I did not verify it and trying to do the same in Java fails. It is binary data taken from a TPEG binary stream. – Yesyoor Mar 01 '22 at 11:19
  • @DavidConrad Huh??? You wrote _"Java always uses UTF-16 to represent Unicode internally"_ and then wrote _"I know that it can store it as Latin1 internally"_. – skomisa Mar 03 '22 at 04:51
  • @skomisa Yes, I shouldn't have said that. You're right. But from the user's point of view, you can only get chars (UTF-16) or ints (Unicode code points). You can never see the Latin1 code points. It's purely an optimization to save space. – David Conrad Mar 03 '22 at 23:35

2 Answers2

3

Since you added the bytes to the question, I have done a little research and some experimenting, and I believe that the text you have is encoded as EUC-KR. I got the expected Korean characters when interpreting them as that encoding.

// convert bytes to a Java String
byte[] data = {91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50};
String str = new String(data, "EUC-KR");

// now convert String to UTF-8 bytes
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8);
System.out.println(HexFormat.ofDelimiter(" ").formatHex(utf8));

This prints the following hexadecimal values:

5b ea b3 b5 ec 82 ac 5d 20 ec 9c a8 ea b3 a1 eb a1 9c

Which is the proper UTF-8 encoding of those Korean characters and, with a terminal that supported them, printing the string should display them properly, too.

David Conrad
  • 15,432
  • 2
  • 42
  • 54
  • 1
    wow thank you very much, it works perfectly and is a clean and native solution. I did not know there existed a "EUC-KR" format. Where did you find the information? I also did some research, but obviously I didn´t search at the right places. – Yesyoor Mar 02 '22 at 08:47
  • 1
    @Yesyoor I searched for encodings for Korean, and then I tried a few of them in Java. "ISO-2022-KR" didn't give good results; "EUC-KR" did. Those were the first two (other than Unicode) that I found, and the only ones I tried. You can [accept my answer](https://stackoverflow.com/help/someone-answers) by clicking the checkmark next to it. – David Conrad Mar 02 '22 at 14:24
  • oh yes I meant to accept the answer but only upvoted it :D has been done now, thanks again for your effort! – Yesyoor Mar 08 '22 at 10:49
0

You should use StandardCharsets.UTF_8. Converting from String to byte[] and vice versa:

import java.util.*;
import java.nio.charset.StandardCharsets;

public class Translater {

    public static String translateBytesToString(byte[] b) {
      return new String(b, StandardCharsets.UTF_8);
    }

    public static byte[] translateStringToBytes(String s) {
      return s.getBytes(StandardCharsets.UTF_8);
    }

    public static void main(String[] args) {
        final String STRING = "[공사] 율곡로";
        final byte[] BYTES = {91, -22, -77, -75, -20, -126, -84, 93, 32, -20, -100, -88, -22, -77, -95, -21, -95, -100};
    
        String s = translateBytesToString(BYTES);
        byte[] b = translateStringToBytes(STRING);
    
        System.out.println("String: " + translateBytesToString(BYTES));
        System.out.print("Bytes: ");
        for (int i=0; i<b.length; i++)
           System.out.print(b[i] + " ");
    }
}
  • hey thanks, I updated my question and added the original bytes. They are different than yours, maybe I was wrong that UTF-8 is used, but it seems to have worked in C# using UTF_8. It does not work for Java though. Maybe the infomation given about C# ewas incorrect as a colleague tried it out there. – Yesyoor Mar 01 '22 at 11:13
  • Please take a look at the bytes I added to the question, your array is 18 bytes long, my array is 13 bytes long. Maybe this is giving a hint? I think the bytes are double encoded as described in Wikipedia. I wonder if there is a lib or native lib that can handle it. – Yesyoor Mar 01 '22 at 11:24
  • 1
    The problem is, the text is not actually UTF-8. – David Conrad Mar 01 '22 at 19:10