4

I'm trying to convert a byte array to a string in Java with the following code:

byte[] myArray = {25, -50, -86, 81, 47, 44, 97, -5, 69, -4, 87, -114, -47, 62, -113, -64, 58, -32, -121, -102, 53, -89, -122, 12, -2, -23, -127, 111, -100, 53, -87, -23, -44, -28, 4, -21, -42, 75, 87, -112, -38, 118, 54, 92, -116, 4, -118, 110, -87, 7, -13, 3, -72, -63, -69, 123, 92, 94, 56, 61, 120, -52, 98, -17, 5, 41, 101, -3, 121, 81, -90, 12, -35, -21, -24, 112, -94, 123, 62, 8, 27, 54, 107, -77, 64, 8, -102, -99, -1, 119, 127, 43, 12, -31, -1, 51, -15, 83, -4, -68, -30, 91, -104, 84, 18, -122, -120, 66, 116, -17, -101, -24, 105, -112, -116, -64, -108, 112, -35, 61, 66, 100, 5, -24, -26, -44, 81, -84}; // Bytes from Byte.MIN_VALUE to Byte.MAX_VALUE
String result = new String(myArray, StandardCharsets.UTF_8);

The problem is that I get a different result if I run the code in windows (JVM 1.8.0_112) than if I run it in my android device (tested in android 5.1 and 6.0). I’m testing with a byte array of length 128, in android I get a String of length 120 while in windows I get a String of length 125. I’m guessing it has something to do with some of the bytes not being valid utf-8 characters, but it is still weird that I get different results depending on the platform.

If I change the encoding to US-ASCII I get the same result in both platforms as expected:

String result = new String(myArray, StandardCharsets.US_ASCII);

Edit: Sorry about the confusion. I'm not generating it at random every time. I just mean that the bytes don't have a meaningful UTF-8 value. This is the byte array that I'm using to test:

System.out.println(Arrays.toString(myArray)): [25, -50, -86, 81, 47, 44, 97, -5, 69, -4, 87, -114, -47, 62, -113, -64, 58, -32, -121, -102, 53, -89, -122, 12, -2, -23, -127, 111, -100, 53, -87, -23, -44, -28, 4, -21, -42, 75, 87, -112, -38, 118, 54, 92, -116, 4, -118, 110, -87, 7, -13, 3, -72, -63, -69, 123, 92, 94, 56, 61, 120, -52, 98, -17, 5, 41, 101, -3, 121, 81, -90, 12, -35, -21, -24, 112, -94, 123, 62, 8, 27, 54, 107, -77, 64, 8, -102, -99, -1, 119, 127, 43, 12, -31, -1, 51, -15, 83, -4, -68, -30, 91, -104, 84, 18, -122, -120, 66, 116, -17, -101, -24, 105, -112, -116, -64, -108, 112, -35, 61, 66, 100, 5, -24, -26, -44, 81, -84]

Edit 2: The windows result:

System.out.println(String(myArray, StandardCharsets.UTF_8)).length: 125
System.out.println(String(myArray, StandardCharsets.UTF_8)): ΪQ/,a�E�W��>��:���5����o�5������KW��v6\��n�����{\^8=x�b�)e�yQ����p�{6k����w+��3�S���[�T��Bt��i����p�=Bd���Q�
System.out.println(toUnicode(String(myArray, StandardCharsets.UTF_8))): \u0019\u03aa\u0051\u002f\u002c\u0061\ufffd\u0045\ufffd\u0057\ufffd\ufffd\u003e\ufffd\ufffd\u003a\ufffd\ufffd\ufffd\u0035\ufffd\ufffd\u000c\ufffd\ufffd\u006f\ufffd\u0035\ufffd\ufffd\ufffd\ufffd\u0004\ufffd\ufffd\u004b\u0057\ufffd\ufffd\u0076\u0036\u005c\ufffd\u0004\ufffd\u006e\ufffd\u0007\ufffd\u0003\ufffd\ufffd\ufffd\u007b\u005c\u005e\u0038\u003d\u0078\ufffd\u0062\ufffd\u0005\u0029\u0065\ufffd\u0079\u0051\ufffd\u000c\ufffd\ufffd\ufffd\u0070\ufffd\u007b\u003e\u0008\u001b\u0036\u006b\ufffd\u0040\u0008\ufffd\ufffd\ufffd\u0077\u007f\u002b\u000c\ufffd\ufffd\u0033\ufffd\u0053\ufffd\ufffd\ufffd\u005b\ufffd\u0054\u0012\ufffd\ufffd\u0042\u0074\ufffd\ufffd\u0069\ufffd\ufffd\ufffd\ufffd\u0070\ufffd\u003d\u0042\u0064\u0005\ufffd\ufffd\ufffd\u0051\ufffd

The android result:

System.out.println(String(myArray, StandardCharsets.UTF_8)).length: 120
System.out.println(String(myArray, StandardCharsets.UTF_8)): ΪQ/,a�E�W��>��:ǚ5����o�5������KW��v6\��n���{{\^8=x�b�)e�yQ����p�{>6k�@���w+�
System.out.println(toUnicode(String(myArray, StandardCharsets.UTF_8))): \u0019\u03aa\u0051\u002f\u002c\u0061\ufffd\u0045\ufffd\u0057\ufffd\ufffd\u003e\ufffd\ufffd\u003a\u01da\u0035\ufffd\ufffd\u000c\ufffd\ufffd\u006f\ufffd\u0035\ufffd\ufffd\ufffd\ufffd\u0004\ufffd\ufffd\u004b\u0057\ufffd\ufffd\u0076\u0036\u005c\ufffd\u0004\ufffd\u006e\ufffd\u0007\ufffd\u0003\ufffd\u007b\u007b\u005c\u005e\u0038\u003d\u0078\ufffd\u0062\ufffd\u0005\u0029\u0065\ufffd\u0079\u0051\ufffd\u000c\ufffd\ufffd\ufffd\u0070\ufffd\u007b\u003e\u0008\u001b\u0036\u006b\ufffd\u0040\u0008\ufffd\ufffd\ufffd\u0077\u007f\u002b\u000c\ufffd\ufffd\u0033\ufffd\u0053\ufffd\ufffd\u005b\ufffd\u0054\u0012\ufffd\ufffd\u0042\u0074\ufffd\ufffd\u0069\ufffd\ufffd\u0014\u0070\ufffd\u003d\u0042\u0064\u0005\ufffd\ufffd\ufffd\u0051\ufffd

Edit 3: Added the correct UTF-16 strings

Edit 4: Changed code to working example

jesm00
  • 115
  • 1
  • 7
  • Are you using the same `myArray` on both platforms? – Steve Smith Mar 24 '17 at 16:26
  • Yes, I have printed it in both platforms and it is exactly the same – jesm00 Mar 24 '17 at 16:29
  • 4
    Let me see if I follow: You generate a ***random*** byte array, and wonder why the results differ every time you run the code? Hmmmm..... Now, if you meant an *arbitrary*, but well-defined/fixed, byte array, used repeatedly, that would be different. If that's the case, then please show us the byte array, or better yet, give us a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – Andreas Mar 24 '17 at 16:31
  • Maybe one of the platforms can't handle converting some values of byte to a character? The String `result` must look quite strange. – Steve Smith Mar 24 '17 at 16:34
  • 1
    @Andreas The OP is also wondering why US-ASCII doesn't produce different results instead, so if that's true maybe he/she is not generating random values at every run. – SantiBailors Mar 24 '17 at 16:36
  • Yeah, I worded the question the wrong way. I think it is clearer now what I'm doing – jesm00 Mar 24 '17 at 16:39
  • 1
    `Charset.forName("UTF-8")` has somehow become `StandardCharsets.UTF_16`. If your tests are that consistent, you should not be surprised about different results… – Holger Mar 24 '17 at 16:40
  • @Holger Someone asked me to print it in UTF-16 (along with the byte array I'm using) – jesm00 Mar 24 '17 at 16:41
  • A Java string is a sequence of UTF-16 code units. To see what they are unambiguously they can be written in literal form like this: "\u0019\u03AA\u0051\u002F\u002C\u0061\uFFFD\u0045\uFFFD…" – Tom Blodget Mar 24 '17 at 16:54

2 Answers2

4

It seems, Android is a bit sloppy when interpreting UTF-8 sequences. The relevant part of the standard is in D92 in chapter 3, “Conformance”:

Before the Unicode Standard, Version 3.1, the problematic “non-shortest form” byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7.

Your input has such “non-shortest form” sequence, e.g. -32, -121, -102 and -63, -69. While Android interprets each of these sequences into a single character, Java correctly rejects these sequences and converts each byte of the malformed input into a single replacement character, hence resulting in a longer string.

You can demonstrate it in Java using a parser that interprets “Modified UTF-8”:

byte[][] samples = {
    { -32, -121, -102 },
    { -63, -69 }
};
for(byte[] array: samples) {
    System.out.println("source: "+Arrays.toString(array));
    String string = new String(array, StandardCharsets.UTF_8);
    System.out.println("strictly interpreted: "+string);
    System.out.println("length: "+string.length());
    ByteBuffer bb = ByteBuffer.allocate(array.length+2);
    bb.putShort((short)array.length).put(array);
    ByteArrayInputStream bis = new ByteArrayInputStream(bb.array());
    DataInputStream dis = new DataInputStream(bis);
    string = dis.readUTF();
    System.out.println("sloppily interpreted: "+string);
    System.out.println("length: "+string.length());
    byte[] actual = string.getBytes(StandardCharsets.UTF_8);
    System.out.println("correct sequence: "+Arrays.toString(actual));
    System.out.println();
}

which will print

source: [-32, -121, -102]
strictly interpreted: ���
length: 3
sloppily interpreted: ǚ
length: 1
correct sequence: [-57, -102]

source: [-63, -69]
strictly interpreted: ��
length: 2
sloppily interpreted: {
length: 1
correct sequence: [123]

It also shows the correct “shortest form” sequences of the characters.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • It's a bit ironic that **Standard**Charsets.UTF_8 uses _modified_ UTF-8. – Tom Blodget Mar 24 '17 at 19:45
  • 1
    @Tom Blodget: in Java, it doesn’t, i.e `new String(array, StandardCharsets.UTF_8)` interprets strictly. It’s `DataInputStream` which uses modified UTF-8 (and documents that). In earlier Java versions, `StandardCharsets.UTF_8`, resp. `"UTF-8"`, did indeed use modified UTF-8, which has been fixed (see also [here](http://stackoverflow.com/a/25404994/2711488)). – Holger Mar 24 '17 at 19:52
2

There are a few differences in the output strings. The first corresponds to the input byte sequence 0xE0 0x87 0x9A. The correct decoding is an exception or replacement character(s). (Should it be one, two, or three replacement characters? I'd argue two, which is what the .NET decoder on my machine gives. But, I prefer exceptions in most cases, anyway.)

Your Andriod JVMs are interpreting that as U+01DA. It's probably "correct" mathematically in an algorithm that performs insufficient checks for invalid sequences.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72