3
byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
   return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
   log.warn("Hash generation failed", e);
}

When I generate a String with given method, and when i apply string.getBytes().length it returns some other value. Max was 32. Why a 16 byte array ends up generating a another size byte string ?

But if i do string.length() it returns 16.

dinesh707
  • 12,106
  • 22
  • 84
  • 134
  • Try several times. or try "string.getBytes().length" not "string.length()" – dinesh707 Apr 20 '15 at 14:09
  • 2
    Wait, what are you trying to do? You are mixing up between bytes and chars; there is _no_ 1 to 1 mapping between both. This looks like a XY problem, so please explain what you want to do instead. – fge Apr 20 '15 at 14:21

6 Answers6

6

This is because your bytes are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes. If a byte cannot be treated as ASCII char nor captured with next byte(s) to form legal unicode char, it is replaced by "�". Such char is transformed into 3 bytes when calling String#getBytes(), thus adding 2 extra bytes to resulting output.

If you're lucky to generate ASCII chars only, String#getBytes() will return 16-byte array, if no, resulting array may be longer. For example, the following code snippet:

byte[] b = new byte[16]; 
Arrays.fill(b, (byte) 190);  
b = new String(b, "UTF-8").getBytes(); 

returns array of 48(!) bytes long.

Alex Salauyou
  • 14,185
  • 5
  • 45
  • 67
  • 1
    Your answer is confusing. "If you're lucky to generate ASCII chars only, where one byte is mapped to one char, result will be 32." This makes no sense. If you create a byte array with following ASCII values `[108, 111, 118, 101, 108, 121, 32, 32, 67, 65, 70, 69, 66, 65, 66, 69]` and create a new string in Latin-1 (`new String(b, "ISO-8859-1").getBytes().length`) or UTF-8 (`new String(b, "UTF-8").getBytes().length`) both have a length of 16 bytes or 16 chars (`... .toCharArray().length` – SubOptimal Apr 20 '15 at 22:09
3

The generated bytes might contain valid multibyte characters.

Take this as example. The string contains only one character, but as byte representation it take three bytes.

String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));

String.length() return the length of the string in characters. The character is one character whereas it's a 3 byte long in UTF-8.

If you change your code like this

Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());

The same bytes are interpreted with a different charset. And following the javadoc from String(byte[] b, String charset)

The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
SubOptimal
  • 22,518
  • 3
  • 53
  • 69
3

Classical mistake born from the misunderstanding of the relationship between bytes and chars, so here we go again.

There is no 1-to-1 mapping between byte and char; it all depends on the character coding you use (in Java, that is a Charset).

Worse: given a byte sequence, it may or may not be encoded to a char sequence.

Try this for instance:

final byte[] buf = new byte[16];
new Random().nextBytes(buf);

final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

decoder.decode(ByteBuffer.wrap(buf));

This is very likely to throw a MalformedInputException.

I know this is not exactly an answer but then you didn't clearly explain your problem; and the example above shows already that you have the wrong understanding between what a byte is and what a char is.

fge
  • 119,121
  • 33
  • 254
  • 329
1

If you look at the string you're producing, most of the random bytes you're generating do not form valid UTF-8 characters. The String constructor, therefore, replaces them with the unicode 'REPLACEMENT CHARACTER' �, which takes up 3 bytes, 0xFFFD.

As an example:

public static void main(String[] args) throws UnsupportedEncodingException
{
    Random random = new Random();

    byte bytes[] = new byte[16];
    random.nextBytes(bytes);
    printBytes(bytes);

    final String s = new String(bytes, "UTF-8");
    System.out.println(s);
    printCharacters(s);
}

private static void printBytes(byte[] bytes)
{
    for (byte aByte : bytes)
    {
        System.out.print(
                Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
    }
    System.out.println();
}

private static void printCharacters(String s)
{
    s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}

On a given run, I got this output:

30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c 
0A��2�8��#JT&͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE
kuporific
  • 10,053
  • 3
  • 42
  • 46
  • 1
    Maybe it's worth to mention that `0xFFFD` (two bytes) is the codepoint of the replacement chracter `�` and it's UTF-8 representation is `EF BF BD` (three bytes). Some more information can be found in the Wikipedia [unicode block - specials](https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29). – SubOptimal Apr 21 '15 at 08:09
0

String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.

read more here

Community
  • 1
  • 1
Subhan
  • 1,544
  • 3
  • 25
  • 58
0

This will try to create a String assuming the bytes are in UTF-8.

new String(bytes, "UTF-8");

This in general will go horribly wrong as UTF-8 multi-byte sequences can be invalid.

Like:

String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);

The second step:

byte[] bytes = s.getBytes();

will use the platform encoding (System.getProperty("file.encoding")). Better specify it.

byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

One should realize, internally String will maintain Unicode, an array of 16-bit char in UTF-16.

One should entirely abstain from using String for byte[]. It will always involve a conversion, cost double memory and be error prone.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138