Encoding variable-length utf8 byte array in Java

Question

Actually I am in a situation where I need to read a string which is in utf8 format but its chars use variable-length encoding so I have problem encoding them to string and I get weird chars when printing it, the chars seem to be in Korean and the is the code I used but had no result:

public static String byteToUTF8(byte[] bytes) {
    try {
        return (new String(bytes, "UTF-8"));

    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    Charset UTF8_CHARSET = Charset.forName("UTF-8");
    return new String(bytes, UTF8_CHARSET);
}

Also I used UTF-16 and got a bit better results, however it was giving me strange chars yet and according to doc provided above I should use utf8.

Thanks in advance for helping.

EDIT:

Base64 value: S0QtOTI2IEdHMDA2AAAAAA==\n

Just a thought, the text might also be encoded improperly on the other end. Just for reference, [here](https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html) are Java's supported encodings and their internal names. — Mena, Nov 03 '16 at 15:43
I don't understand the page you linked to. Is that XML document the content you're trying to decode? — Sotirios Delimanolis, Nov 03 '16 at 15:47
@SotiriosDelimanolis it is the document of bluetooth, I am trying to read `model number string` from BLE service and it's encoding has problem. — M. Erfan Mowlaei, Nov 03 '16 at 15:54
I'm extremely confused. Can you base64 encode your `byte[]` and post that here for us to attempt to reproduce your issue? — Sotirios Delimanolis, Nov 03 '16 at 16:05
I don't know if that `\n` is part of the value, but it shouldn't be there. If we get rid of it, [I can't reproduce your issue](http://ideone.com/56XMM7). — Sotirios Delimanolis, Nov 03 '16 at 16:17
@SotiriosDelimanolis http://ideone.com/nxSYmX check this please. — M. Erfan Mowlaei, Nov 03 '16 at 16:27
@SotiriosDelimanolis I also updated the link to doc of attribute I want to get, sorry it was a wrong one at first. — M. Erfan Mowlaei, Nov 03 '16 at 16:42
@SotiriosDelimanolis look at picture I provided, its just another value which comes from bluetooth device. — M. Erfan Mowlaei, Nov 03 '16 at 16:52
I'm still really confused. Why are you trying to decode that value? You're trying to get some kind of serial id. Can you provide a [mcve]? — Sotirios Delimanolis, Nov 03 '16 at 16:58
here are hex values (bytes) of the field: SerialNumber= (0x) 08-7c-be-00-2A-67 and I want to convert it to string, and according to doc I know it uses a variable-length UTF-8 encoding, but I can't convert it. now you can reproduce it I guess. — M. Erfan Mowlaei, Nov 03 '16 at 17:08
The byte sequence (0x) 08-7c-be-00-2A-67 is *clearly* not valid UTF-8, if anything it looks like a MAC. I'm pretty sure this is raw binary. — Durandal, Nov 03 '16 at 20:04
@Durandal it is byte array that I am getting from BLE device, I need to encode it to a valid String. — M. Erfan Mowlaei, Nov 04 '16 at 12:46
If I decode `S0QtOTI2IEdHMDA2AAAAAA==` in base 64, I get `KD-926 GG006`, I don't see any Korean characters — Nicolas Filotto, Nov 07 '16 at 08:51
Assuming that you print into the console, have a look to http://stackoverflow.com/questions/29695918/intellij-idea-console-issue — Nicolas Filotto, Nov 07 '16 at 08:58
@NicolasFilotto I am using both print and debug tool as shown in picture above, and I am getting garbage, concerning the `KD-926...` yeah it is the value extracted and it is right, but for other variables in picture it dowsn't work. like `udi` and `softwareVersion`. `udi` is exactly the ` org.bluetooth.characteristic.serial_number_string ` coming from standard BLE device. — M. Erfan Mowlaei, Nov 07 '16 at 09:58
for `softwareVersion`, I get `102` but indeed for `udi`, I get weird characters — Nicolas Filotto, Nov 07 '16 at 10:09
Please note that I believe that you misunderstand the doc, `UTF-8` **is** a *variable length encoding* because according to the character to encode it will be encoded in `1` to `5` bytes — Nicolas Filotto, Nov 07 '16 at 10:28
@NicolasFilotto I knew it is variable size itself, but I am getting weired chars only on fields which have variable-size encoding, I mean other fields in BLE contract are constant-sized and they give me reasonable output but the fields which are mentioned to have variable-size encoding give weired results. — M. Erfan Mowlaei, Nov 07 '16 at 11:34
Your uid Base64 value converted to bytes is (in hex) 08 7C BE 00 2A 67. If you try to interpret that as a UTF-8 string your get 8 (backspace) not any good start, followed by "|" Next, in UTF-8 if the first byte is in the range A1-F5 then the next must be 21–7E or A0–FF but instead you have 00, last 2A 67 is identical with dot mathematical character. As said in previous comments, model and softwareVersion could be decoded and interpreted as valid UTF-8 strings except for the trailing \n and the AAAAAA== which will yield for zeros. — Serg M Ten, Nov 10 '16 at 10:59
@SergioMontoro yeah I already knew what you said, all values except uid are valid(ignoring those trailing chars) but I need to somehow find the right encoding and get valid value for uid. — M. Erfan Mowlaei, Nov 10 '16 at 12:34
Even if you try new String(byte[], "charset_name") with every possible supported Java encoding (see https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html) That first 08 in uid will make you no good. I think there is no encoding with which you can get a meaningful string out of Base64 CHy+ACpn — Serg M Ten, Nov 10 '16 at 13:59
@Sergio Montoro what you said may be the case, I have sent an email to company to verify it — M. Erfan Mowlaei, Nov 10 '16 at 17:57
I have a feeling this problem would be easily solved with a compete sample. It should only try to get the serial number from a bluetooth device. It is clear to me that the data you are retrieving is either not the entire serial number, or is not a serial number at all. I'm flagging to close until you produce a complete sample. — mttdbrd, Nov 13 '16 at 15:58
@mttdbrd the thing is you need to have access to this perticular device to reproduce it, I can give you raw bytes and base64 encoded string if you want... — M. Erfan Mowlaei, Nov 13 '16 at 20:22
In that case, I would write the simplest sample that demonstrates the problem with any device. If it works for other devices as expected, then the device is faulty. — mttdbrd, Nov 14 '16 at 01:19

score 5 · Accepted Answer · edited May 23 '17 at 12:08

Bluetooth name display issue:

If you check Bluetooth adapter setName(), you will get that

https://developer.android.com/reference/android/bluetooth/BluetoothAdapter.html#setName

Valid Bluetooth names are a maximum of 248 bytes using UTF-8 encoding, although many remote devices can only display the first 40 characters, and some may be limited to just 20.

Android Supported Versions:

If you check the link https://stackoverflow.com/a/7989085/2293534, you will get the list of android supported version.

Supported and Non supported locales are given in the table:

-----------------------------------------------------------------------------------------------------
             | DEC Korean | Korean EUC | ISO-2022-KR | KSC5601/cp949 | UCS-2/UTF-16 | UCS-4 | UTF-8 |
-----------------------------------------------------------------------------------------------------
 DEC Korean  |      -     |      Y     |     N       |      Y        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 Korean EUC  |      Y     |      -     |     Y       |      N        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
 ISO-2022-KR |      N     |      Y     |     -       |      Y        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
KSC5601/cp949|      Y     |      N     |     Y       |      -        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 UCS-2/UTF-16|      Y     |      N     |     N       |      Y        |        -     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
    UCS-4    |      Y     |      N     |     N       |      Y        |        Y     |   -   |   Y   |
-----------------------------------------------------------------------------------------------------
    UTF-8    |      Y     |      N     |     N       |      Y        |        Y     |   Y   |   -   |
-----------------------------------------------------------------------------------------------------

For solution,

Solution#1:

Michael has given a great example for conversion. For more you can check https://stackoverflow.com/a/40070761/2293534

When you call getBytes(), you are getting the raw bytes of the string encoded under your system's native character encoding (which may or may not be UTF-8). Then, you are treating those bytes as if they were encoded in UTF-8, which they might not be.

A more reliable approach would be to read the ko_KR-euc file into a Java String. Then, write out the Java String using UTF-8 encoding.
InputStream in = ...
Reader reader = new InputStreamReader(in, "ko_KR-euc"); // you can use specific korean locale here
StringBuilder sb = new StringBuilder();
int read;
while ((read = reader.read()) != -1){
  sb.append((char)read);
}
reader.close();

String string = sb.toString();

OutputStream out = ...
Writer writer = new OutputStreamWriter(out, "UTF-8");
writer.write(string);
writer.close();
N.B: You should, of course, use the correct encoding name

Solution#2:

Using StringUtils, you can do it https://stackoverflow.com/a/30170431/2293534

Solutions#3:

You can use Apache Commons IO for conversion. A very great example is given here: http://www.utdallas.edu/~lmorenoc/research/icse2015/commons-io-2.4/examples/toString_49.html

1 String resource;
2 //getClass().getResourceAsStream(resource) -> the <code>InputStream</code> to read from
3 //"UTF-8" -> the encoding to use, null means platform default
4 IOUtils.toString(getClass().getResourceAsStream(resource),"UTF-8");

Resource Links:

Thanks I'll check and inform you, though the solution shouldn't be locale specific. — M. Erfan Mowlaei, Nov 07 '16 at 08:09

score 2 · Answer 2 · answered Nov 07 '16 at 01:31

2

I suggest you use StringUtils per Apache libraries. I believe the necessary methods for your are documented here:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/StringUtils.html

answered Nov 07 '16 at 01:31

Nikolaj Hansen

135
6

I've seen this utils before and I neglegted it because of overhead of library, but I'll give it a try and I'll let you know the result. Note that the source should be byte[] and I guess coverting it first to Base64 or something else before encoding to UTF8 probably ruines everything. – M. Erfan Mowlaei Nov 07 '16 at 05:03
Then your string is not UTF-8 – Nikolaj Hansen Nov 11 '16 at 14:49
It should be, at least according to docs, but there may be problems in company that used it to feed BLE device. – M. Erfan Mowlaei Nov 11 '16 at 15:11