-1

Given a String of length Integer.MAX_VALUE which contains characters that require more than one byte to represent, such as Chinese ideograms, what result would I get if I executed String.getBytes()? Is there any good way of testing for this type of error?

trBlueJ
  • 75
  • 6
  • 1
    You'd probably get something like [this](https://stackoverflow.com/questions/3038392/do-java-arrays-have-a-maximum-size). – Sweeper Jan 02 '21 at 02:46

3 Answers3

1

The Question I have for you is how you could come up with such a String. I couldn't find a way to build a String that big. Everything I tried gave me an error like:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit

The longest String I could find a way to build consisting of two-byte characters has a size in bytes just shy of Integer.MAX_VALUE. I did that via:

String foo = "\uD83D".repeat((Integer.MAX_VALUE)/2-1);

which gives you a String of 1073741822 characters or 2147483644 bytes. So I can't answer your question for a longer String than this, but this String causes an error when you try to convert it to bytes via:

byte[] blah = foo.getBytes();

You get the error:

Exception in thread "main" java.lang.NegativeArraySizeException: -1073741830

I expect you'd fare no better if you could somehow come up with a String that was longer in bytes. I hope this answers both your "what would happen" and "how would you test" questions.

Here's my complete test and output:

public class Test {
    public static void main(String[] args) {

        // Display MAX_VALUE
        System.out.println(Integer.MAX_VALUE);

        // By a bit of trial and error, build the longest two-byte character string possible with String.repeat()
        String foo = "\uD83D".repeat((Integer.MAX_VALUE)/2-1);

        // Display the number of bytes this string takes to store, which is just short of Integer.MAX_VALUE
        System.out.println(foo.length());
        System.out.println(foo.length()*2);

        // This line craps out even though the String length in bytes is less than Integer.MAX_VALUE
        byte[] blah = foo.getBytes();
    }
}

Result:

2147483647
1073741822
2147483644
Exception in thread "main" java.lang.NegativeArraySizeException: -1073741830
    at java.base/java.lang.StringCoding.encodeUTF8_UTF16(StringCoding.java:910)
    at java.base/java.lang.StringCoding.encodeUTF8(StringCoding.java:885)
    at java.base/java.lang.StringCoding.encode(StringCoding.java:489)
    at java.base/java.lang.String.getBytes(String.java:981)
    at Test.main(Test.java:15)

You should be able to catch any exception you might get during your String processing, which you'd probably get while building up your String rather than when converting it to bytes. Just remember to catch a Throwable, as most of the errors you will get will be RuntimeExceptionss rather than Exceptions. Throwable will catch either.

CryptoFool
  • 21,719
  • 5
  • 26
  • 44
0

Based on what seems to be the source code for the JRE String class, it calls an 'encode' method in the StringCoding class, which calculates the maximum number of bytes needed for the given string, and returns the result in an int. See the 'encode' method which calls 'scale'.

So, depending on the exact result, you'll either get string truncation (if the result is positive) or total failure (if the result appears negative). Since I didn't chase the logic down into the ArrayEncoder class, it's possible there will also be an 'array index out of bounds' exception during the conversion.

(Link is to some random copy of source code on the internet, probably not the current code).

This is presumably of theoretical interested only -- a String with 2 billion characters is not likely to perform well.

a guest
  • 462
  • 3
  • 5
0

String is a sophisticated immutable class. Historically it just held char[] array of UTF-16 two byte chars. And then String.getBytes(StandardCharsets.UTF_8) might indeed be assumed to overflow the index range.

However nowadays String already holds a byte[] value. This is for compacting strings in an other Charset. The problem still exists, for instance a compacted ISO-8859-1 String of almost Integer.MAX_VALUE can explode in UTF-8 (even with String.toCharArray()). An OutOfMemoryException.

Hence there are some different overflows possible, but for UTF16 chars to getBytes(UTF-8):

private static final int MAX_INDEX = Integer.MAX_VALUE;

void checkUtf8Bytes(String s) {
    if (s.length() < MAX_INDEX / 6) {
        return; // Not hurt by UTF-8 6 byte sequences.
    }
    if (s.codePoints().mapToLong(this::bytesNeeded).sum() > MAX_INDEX) {
        throw IllegalArgumentException();
    }
}

private int bytesNeeded(int codePoint) {
    if (codePoint < 128) {
        return 1;
    } else if (codePoint ...) {
    ...
}

I think it is easier to catch an OutOfMemoryException.

Mind that the normal String with UTF-16 chars in the bytes can hold no more that Integer.MAX_VALUE / 2 bytes.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138