3

I'm maintaining a back-end service in Java and I have the following method of Java 8 code that's used to validate the input to my service API:

private static boolean containsDisallowedChars(String toValidate) {
    return !StandardCharsets.US_ASCII.newEncoder().canEncode(toValidate);
}

I'm expanding it to support Hindi and other non-English characters, so I've changed it from ASCII to UTF-8, as follows:

private static boolean containsDisallowedChars(String toValidate) {
    return !StandardCharsets.UTF_8.newEncoder().canEncode(toValidate);
}

Now I'm trying to update the corresponding unit test to pass in a String toValidate that will cause this method to return false.

How can I make a Java String that contains contents that can't be encoded to UTF-8?

I tried this test setup

// ref https://stackoverflow.com/questions/1301402/example-invalid-utf8-string
// test data byte values https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
// 3.5  Impossible bytes
// The following two bytes cannot appear in a correct UTF-8 string
// 3.5.1  fe = "�"
// 3.5.2  ff = "�"
// 3.5.3  fe fe ff ff = "����"
final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(bytes);
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());

StandardCharsets.UTF_8.newEncoder().canEncode(s) returns true and the log output shows that the String class constructor is changing the byte array as follows:

bytes=[-2, -2, -1, -1]
s=����
s.length=4
s.bytes=[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]

I tried several variations on this with similar results using other invalid UTF-8 byte arrays described in https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

It seems as if the String class is robustly creating valid UTF-8 strings despite my efforts to supply invalid byte arrays.

I tried Base64 as suggested here How can I generate non-UTF-8 string / char in Java for testing purposes?

final byte[] bytes = {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff};
log.info("bytes={}", bytes);
final String s = new String(Base64.getEncoder().encode(bytes));
log.info("s={}", s);
log.info("s.length={}", s.length());
log.info("s.bytes={}", s.getBytes());

Base64.getEncoder().encode doesn't return string. It returns byte[]. Therefore I must still call new String(byte[]) which changes the byte array to a valid UTF-8 byte array. StandardCharsets.UTF_8.newEncoder().canEncode still returns true and I get this log output:

bytes=[-2, -2, -1, -1]
s=/v7//w==
s.length=8
s.bytes=[47, 118, 55, 47, 47, 119, 61, 61]

Is it possible to create a Java String object that contains a string that can't be encoded as UTF-8? If not, does it mean my containsDisallowedChars method is unnecessary since it can never return true? Or is there a different validation approach I should consider instead of StandardCharsets.UTF_8.newEncoder().canEncode?

Ris Misner
  • 33
  • 4
  • 1
    Before you do anything else: Specify a charset in your conversions between bytes and String! Unless you are using a fairly recent version of Java, `new String(bytes)` and `s.getBytes()` will use the system’s default charset, which may or may not be UTF-8. Don’t take chances—be certain by using `new String(bytes, StandardCharsetes.UTF_8)` and `s.getBytes(StandardCharsets.UTF_8)` respectively. To answer your question: You can write `"\ufffe\uffff"` in your code. They’re not valid Unicode codepoints, but they are valid 16-bit char values. – VGR Jan 10 '23 at 14:29

1 Answers1

1

In your question, you noted:

It seems as if the String class is robustly creating valid UTF-8 strings despite my efforts to supply invalid byte arrays.

If you want to test a byte array to see if it is valid for a specific encoding, then you can use CharsetDecoder (not CharsetEncoder).

The CharsetDecoder can:

transform a sequence of bytes in a specific charset into a sequence of sixteen-bit Unicode characters.

If you pass the decode() method a ByteBuffer, you can use use it as follows:

private static boolean testBytes(byte[] bytes) {
    boolean isValid = true;
    try {
        StandardCharsets.UTF_8.newDecoder().decode(ByteBuffer.wrap(bytes));
    } catch (CharacterCodingException ex) {
        //Logger.getLogger(MyTester.class.getName()).log(Level.SEVERE, null, ex);
        isValid = false;
    }
    return isValid;
}

So, for example, the following will print false because 0xFF is not a valid UTF-8 byte sequence.

byte[] b = HexFormat.of().parseHex("ff");
System.out.println(testBytes(b));

Your example {(byte)0xfe, (byte)0xfe, (byte)0xff, (byte)0xff} will also return false.


In your question, you asked:

Is it possible to create a Java String object that contains a string that can't be encoded as UTF-8?

By the time you have created a Java String, it's "too late" to check because, as you have seen, any unsupported byte sequences have already been replaced by the Unicode replacement character - which is itself a valid character in a Java string (the Java String object itself "represents a string in the UTF-16 format" - and both UTF-8 and UTF-16 cover all valid Unicode code points).

andrewJames
  • 19,570
  • 8
  • 19
  • 51