Why new String with UTF-8 contains more bytes

Question

byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
   return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
   log.warn("Hash generation failed", e);
}

When I generate a String with given method, and when i apply string.getBytes().length it returns some other value. Max was 32. Why a 16 byte array ends up generating a another size byte string ?

But if i do string.length() it returns 16.

Try several times. or try "string.getBytes().length" not "string.length()" — dinesh707, Apr 20 '15 at 14:09
Wait, what are you trying to do? You are mixing up between bytes and chars; there is _no_ 1 to 1 mapping between both. This looks like a XY problem, so please explain what you want to do instead. — fge, Apr 20 '15 at 14:21

Alex Salauyou · Accepted Answer · 2015-04-21T09:01:43.023

6

This is because your bytes are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes. If a byte cannot be treated as ASCII char nor captured with next byte(s) to form legal unicode char, it is replaced by "�". Such char is transformed into 3 bytes when calling String#getBytes(), thus adding 2 extra bytes to resulting output.

If you're lucky to generate ASCII chars only, String#getBytes() will return 16-byte array, if no, resulting array may be longer. For example, the following code snippet:

byte[] b = new byte[16]; 
Arrays.fill(b, (byte) 190);  
b = new String(b, "UTF-8").getBytes();

returns array of 48(!) bytes long.

edited Apr 21 '15 at 09:01

answered Apr 20 '15 at 14:13

Alex Salauyou

14,185
5
45
67

1

Your answer is confusing. "If you're lucky to generate ASCII chars only, where one byte is mapped to one char, result will be 32." This makes no sense. If you create a byte array with following ASCII values `[108, 111, 118, 101, 108, 121, 32, 32, 67, 65, 70, 69, 66, 65, 66, 69]` and create a new string in Latin-1 (`new String(b, "ISO-8859-1").getBytes().length`) or UTF-8 (`new String(b, "UTF-8").getBytes().length`) both have a length of 16 bytes or 16 chars (`... .toCharArray().length` – SubOptimal Apr 20 '15 at 22:09

SubOptimal · Answer 2 · 2015-04-21T07:50:39.937

The generated bytes might contain valid multibyte characters.

Take this as example. The string contains only one character, but as byte representation it take three bytes.

String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));

String.length() return the length of the string in characters. The character Ω is one character whereas it's a 3 byte long in UTF-8.

If you change your code like this

Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());

The same bytes are interpreted with a different charset. And following the javadoc from String(byte[] b, String charset)

The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.

But i convert byte[] into string. And string back to byte[]. ?? — dinesh707, Apr 20 '15 at 14:12

score 3 · Answer 3 · answered Apr 20 '15 at 14:28

Classical mistake born from the misunderstanding of the relationship between bytes and chars, so here we go again.

There is no 1-to-1 mapping between byte and char; it all depends on the character coding you use (in Java, that is a Charset).

Worse: given a byte sequence, it may or may not be encoded to a char sequence.

Try this for instance:

final byte[] buf = new byte[16];
new Random().nextBytes(buf);

final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

decoder.decode(ByteBuffer.wrap(buf));

This is very likely to throw a MalformedInputException.

I know this is not exactly an answer but then you didn't clearly explain your problem; and the example above shows already that you have the wrong understanding between what a byte is and what a char is.

score 1 · Answer 4 · answered Apr 20 '15 at 14:36

If you look at the string you're producing, most of the random bytes you're generating do not form valid UTF-8 characters. The String constructor, therefore, replaces them with the unicode 'REPLACEMENT CHARACTER' �, which takes up 3 bytes, 0xFFFD.

As an example:

public static void main(String[] args) throws UnsupportedEncodingException
{
    Random random = new Random();

    byte bytes[] = new byte[16];
    random.nextBytes(bytes);
    printBytes(bytes);

    final String s = new String(bytes, "UTF-8");
    System.out.println(s);
    printCharacters(s);
}

private static void printBytes(byte[] bytes)
{
    for (byte aByte : bytes)
    {
        System.out.print(
                Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
    }
    System.out.println();
}

private static void printCharacters(String s)
{
    s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}

On a given run, I got this output:

30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c 
0A��2�8��#JT&͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE

Maybe it's worth to mention that `0xFFFD` (two bytes) is the codepoint of the replacement chracter `�` and it's UTF-8 representation is `EF BF BD` (three bytes). Some more information can be found in the Wikipedia [unicode block - specials](https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29). — SubOptimal, Apr 21 '15 at 08:09

score 0 · Answer 5 · edited May 23 '17 at 12:17

0

String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.

Why new String with UTF-8 contains more bytes

6 Answers6