Why Java do not respect given array length?

Question

I saw problem in this piece of code:

byte[] buf = new byte[6];
buf = "abcdef".getBytes();
System.out.println(buf.length);

Array was made for 6 bytes. If I get bytes from string with length 6 I will get much more bytes. So how will all these bytes get into this array? But this is working. Moreover, buf.length shows length of that array as it is array of chars not those bytes. Afterwards, I realized that in

byte[] buf = new byte[6];

6 does not mean much, i.e. I can put there 0 or 1 or 2 or so on and code will work (with buf.length showing length of given string not array - what I see as second problem or discrepancy).

This question is different than Why does Java's String.getBytes() uses “ISO-8859-1” because it have one aspect more, at least: variables assignment oversight (getBytes() returns new array), i.e. it don't fully address my question.

First, there's no need to pre-allocate the byte[], because getBytes allocates one for you. What happens is that the garbage collector will identify that the first array you've allocated is unnecessary anymore and will get rid of it for you. Second, no one said it's one byte per character. It depends on your text encoding (and there are many different encoding types; e.g., in UNICODE, your String will take 14 bytes). — Moshe Bixenshpaner, Aug 08 '15 at 14:07
I thought that getBytes() return array of 0 and 1. Acctualy, I was not sure. Now I tested. — Luka, Aug 08 '15 at 14:59
@MosheBixenshpaner - your assertion is wrong; this string will NOT take 14 bytes in Unicode. Unicode does not determine how many bytes a string takes. The **encoding** determines that depending on the **contents** of the String. UTF-8, UTF-16 and UTF-32 all can encode Unicode characters. Unicode is not tied to the byte representation its specification says nothing about that. — , Aug 08 '15 at 15:56
@JarrodRoberson, agreed. Read my comment with UTF-16 instead of UNICODE. — Moshe Bixenshpaner, Aug 08 '15 at 22:46
@MosheBixenshpaner - `6` characters `*` 2 bytes ( or more ) is still not equal to `14` for `"abcdef"`. — , Aug 09 '15 at 00:24
When working with UTF-16, Java adds the BOM character at the beginning of the text by default, so you should actually expect it to be (1+6) * 2 = 14. That being said, UTF-16 doesn't necessarily take 2 bytes per character. — Moshe Bixenshpaner, Aug 09 '15 at 02:01

score 6 · Answer 1 · 2015-08-08T16:29:06.603

That is not how variable assignments work

Thinking that assigning a 6 byte array to a variable will limit the length of any other arrays assigned to the same variable show a fundamental lack of comprehension on what variable are and how they work.

Really think about why you think assigning a variable to a fixed length array would limit the length of being assigned to another length array?

Strings are Unicode in Java

Strings in Java are Unicode and internally represented as UTF-16 which means they are 2 or 4 bytes per character in memory. When they are converted to a byte array the number of bytes that represents the string is determined by what encoding is used when converting to the byte[].

Always specify an appropriate character encoding when converting Strings to arrays to get what you expect.

But even then UTF-8 would not guarantee single bytes per character, and ASCII would be not be able to represent non ASCII Unicode characters.

Character encoding is tricky

The ubiquitous internet encoding standard is UTF-8 it will correct in 99.9999999% of all cases, in those cases it isn't converting UTF-8 to the correct encoding is trivial because UTF-8 is so well supported in every toolchain.

Learn to make everything final and you will a lot easier time and less confusion.

import com.google.common.base.Charsets;

import javax.annotation.Nonnull;
import java.util.Arrays;

public class Scratch
{
    public static void main(final String[] args)
    {
        printWithEncodings("Hello World!");
        printWithEncodings("こんにちは世界!");
    }

    private static void printWithEncodings(@Nonnull final String s)
    {
        System.out.println("s = " + s);
        final byte[] defaultEncoding = s.getBytes(); // never do this, you do not know what you will get!
        // for ASCII characters the first three will all be the same single byte representations
        final byte[] iso88591Encoding = s.getBytes(Charsets.ISO_8859_1);
        final byte[] asciiEncoding = s.getBytes(Charsets.US_ASCII);
        final byte[] utf8Encoding = s.getBytes(Charsets.UTF_8);
        final byte[] utf16Encoding = s.getBytes(Charsets.UTF_16);

        System.out.println("Arrays.toString(defaultEncoding) = " + Arrays.toString(defaultEncoding));
        System.out.println("Arrays.toString(iso88591) = " + Arrays.toString(iso88591Encoding));
        System.out.println("Arrays.toString(asciiEncoding) = " + Arrays.toString(asciiEncoding));
        System.out.println("Arrays.toString(utf8Encoding) = " + Arrays.toString(utf8Encoding));
        System.out.println("Arrays.toString(utf16Encoding) = " + Arrays.toString(utf16Encoding));
    }
}

results in

s = Hello World!
Arrays.toString(defaultEncoding) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
Arrays.toString(iso88591) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
Arrays.toString(asciiEncoding) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
Arrays.toString(utf8Encoding) = [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33]
Arrays.toString(utf16Encoding) = [-2, -1, 0, 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 32, 0, 87, 0, 111, 0, 114, 0, 108, 0, 100, 0, 33]
s = こんにちは世界!
Arrays.toString(defaultEncoding) = [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95, -29, -127, -81, -28, -72, -106, -25, -107, -116, 33]
Arrays.toString(iso88591) = [63, 63, 63, 63, 63, 63, 63, 33]
Arrays.toString(asciiEncoding) = [63, 63, 63, 63, 63, 63, 63, 33]
Arrays.toString(utf8Encoding) = [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95, -29, -127, -81, -28, -72, -106, -25, -107, -116, 33]
Arrays.toString(utf16Encoding) = [-2, -1, 48, 83, 48, -109, 48, 107, 48, 97, 48, 111, 78, 22, 117, 76, 0, 33]

Always specify the Charset encoding!

.bytes(Charset) is always the correct way to convert a String to bytes. Use whatever encoding you need.

Internally supported encodings for JDK7

score 3 · Answer 2 · answered Aug 08 '15 at 13:47

3

new byte[6]; has no effect whatsoever as the array reference buf is getting updated with reference of the array returned by "abcdef".getBytes();.

answered Aug 08 '15 at 13:47

Wand Maker

18,476
8
53
87

Thank you. That was answer to one of my questions (to one in title). – Luka Aug 08 '15 at 15:18

score 2 · Accepted Answer · answered Aug 08 '15 at 13:48

2

That's because String.getBytes() returns an entirely different array object which is then assigned to buf. You could have just as easily done this:

byte[] buf = "abcdef".getBytes();
System.out.println(buf.length);

answered Aug 08 '15 at 13:48

PakkuDon

1,627
4
22
21

2

this will still not get you a single byte per character so this is not a complete or correct answer – Aug 08 '15 at 14:01
Thank you. That was answer to one of my questions (to one in title). – Luka Aug 08 '15 at 15:18

Why Java do not respect given array length?

3 Answers3

That is not how variable assignments work

Strings are Unicode in Java

Character encoding is tricky

Always specify the Charset encoding!