I'm maintaining a back-end service in Java and I have the following method of Java 8 code that's used to validate the input to my service API:
private static boolean containsDisallowedChars(String toValidate) {
return !StandardCharsets.US_ASCII.newEncoder().canEncode(toValidate);
}
I'm expanding it to support Hindi and other non-English characters, so I've changed it from ASCII to UTF-8, as follows:
private static boolean containsDisallowedChars(String toValidate) {
return !StandardCharsets.UTF_8.newEncoder().canEncode(toValidate);
}
Now I'm trying to update the corresponding unit test to pass in a String toValidate that will cause this method to return false.
How can I make a Java String that contains contents that can't be encoded to UTF-8?
I tried this test setup
// ref https://stackoverflow.com/questions/1301402/example-invalid-utf8-string
// test data byte values https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
// 3.5 Impossible bytes
// The following two bytes cannot appear in a correct UTF-8 string
// 3.5.1 fe = "�"
// 3.5.2 ff = "�"
// 3.5.3 fe fe ff ff = "����"
final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(bytes);
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());
StandardCharsets.UTF_8.newEncoder().canEncode(s) returns true and the log output shows that the String class constructor is changing the byte array as follows:
bytes=[-2, -2, -1, -1]
s=����
s.length=4
s.bytes=[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
I tried several variations on this with similar results using other invalid UTF-8 byte arrays described in https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
It seems as if the String class is robustly creating valid UTF-8 strings despite my efforts to supply invalid byte arrays.
I tried Base64 as suggested here How can I generate non-UTF-8 string / char in Java for testing purposes?
final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(Base64.getEncoder().encode(bytes));
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());
Base64.getEncoder().encode doesn't return string. It returns byte[]. Therefore I must still call new String(byte[]) which changes the byte array to a valid UTF-8 byte array. StandardCharsets.UTF_8.newEncoder().canEncode still returns true and I get this log output:
bytes=[-2, -2, -1, -1]
s=/v7//w==
s.length=8
s.bytes=[47, 118, 55, 47, 47, 119, 61, 61]
Is it possible to create a Java String object that contains a string that can't be encoded as UTF-8? If not, does it mean my containsDisallowedChars
method is unnecessary since it can never return true? Or is there a different validation approach I should consider instead of StandardCharsets.UTF_8.newEncoder().canEncode?