Converting ByteArray to string and back produces different string

Question

I have to store huge list of booleans and I chose to store them as byte array as string. But I can't understand, why converting to string and back produces different string values:

Support methods:

  fun ByteArray.string(): String {

    var str = ""

    this.reversed().forEach {
      str += intToString(it, 4)
    }

    return str
  }

  fun intToString(number: Byte, groupSize: Int): String {
    val result = StringBuilder()

    for (i in 7 downTo 0) {
      val mask = 1 shl i
      result.append(if (number.toInt() and mask != 0) "1" else "0")

      if (i % groupSize == 0)
        result.append(" ")
    }
    result.replace(result.length - 1, result.length, "")

    return result.toString()
  }

First example:

Given selected indices [0, 14] my code converts to: as bytes: [1, 64]. .string() produces:

0100 0000 0000 0001

Convert it to string and back:

array.toString(Charsets.UTF_8).toByteArray(Charsets.UTF_8)

Result: [1, 64], .string() produces:

0100 0000 0000 0001

Second example:

Given selected indices [0, 15] my code converts to: as bytes: [1,-128]. .string() produces:

1000 0000 0000 0001

Which seems pretty legal. Now convert it to the string and back

It produces an array of 4 bytes: [1, -17, -65, -67], .string() produces:

1011 1101 1011 1111 1110 1111 0000 0001

Which doesn't look like [0, 15] indices or [1,-128] for me :)

How can this happen? I suspect this last "1" in "1000 0000 0000 0001", probably it may cause this issue, but still, I don't know the answer.

Thanks.

P.S. Added java tag to the question, because I think the answer is the same for both kotlin and java.

*I chose to store them as byte array as string* - may I ask you **why**? — zlakad, Apr 02 '18 at 19:47
Need to store huge amount of booleans with unknown upper limit, so storing them as part of simple int is impossible, but string can be unlimited :) — Anton Shkurenko, Apr 02 '18 at 19:49
Your question is already answered: https://stackoverflow.com/questions/1536054/how-to-convert-byte-array-to-string-and-vice-versa — zlakad, Apr 02 '18 at 19:52
Even Base64 should be more efficient than using 1 character per bit — jrtapsell, Apr 02 '18 at 19:55
@jrtapsell it's not 1 character per bit. It's eight bits per byte. Converting to the readable string is just an example — Anton Shkurenko, Apr 02 '18 at 19:57
Consider using [Biginteger](https://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html) with BigInteger(String val, 2) and toString(2). You can manipulate it with the bit operators. Depending on what you do this is more memory efficient. — leonardkraemer, Apr 02 '18 at 20:00

that other guy · Accepted Answer · 2018-04-02T20:01:11.003

Here's a MCVE for your problem (in Java):

import java.nio.charset.*;

class Test {
  public static void main(String[] args) {
    byte[] array = { -128 };
    byte[] convertedArray = new String(array, StandardCharsets.UTF_8).getBytes(StandardCharsets.UTF_8);
    for(int i=0; i<convertedArray.length; i++) {
      System.out.println(convertedArray[i]);
    }
  }
}

Expected output:

-128

Actual output:

-17
-65
-67

This happens because the byte -128 is not a valid UTF-8 character, so it gets replaced with the Unicode replacement character U+FFFD "�".

You can instead encode and decode the string as ISO-8859-1 aka Latin1, since all byte strings are valid in the ISO-8859 family of encodings. ISO-8859-1 has the convenient property that each byte value corresponds directly to the same unicode code point, so that 0x80 is encoded as U+0080, 0xFF as U+00FF etc.

Yeah, this is what I see! But how to avoid that? – Anton Shkurenko Apr 02 '18 at 19:56 — Anton Shkurenko, Apr 02 '18 at 19:56

Converting ByteArray to string and back produces different string

1 Answers1