2

For any given Java String s, I would like to know if the array of characters represented by s is guaranteed to be a valid UTF-16 string, e.g.:

final char[] ch = new char[s.length()];
for (int i = 0; i < ch.length; ++i) {
    ch[i] = s.charAt(i);
}
// Is ch guaranteed to be a valid UTF-16 encoded string?

If not, what are some simple Java-language test cases that produce invalid UTF-16?

EDIT: Somebody has flagged the question as a possible duplicate of [Is a Java char array always a valid UTF-16 (Big Endian) encoding? All I can say is, there's a difference between a String and a char[] and a reason why the former might, at least theoretically, have guarantees as to its contents that the latter does not. I'm not asking a question about arrays, I'm asking a question about Strings.

Community
  • 1
  • 1
0xbe5077ed
  • 4,565
  • 6
  • 35
  • 77
  • Answered in this related [link](http://stackoverflow.com/questions/7019504/in-what-encoding-is-a-java-char-stored-in) – Ferdinand Neman Aug 27 '15 at 03:56
  • @FerdinandNeman, I don't think the link answers the question. I understand that a `char` represents a UTF-16 code point, and that a String is a sequence of `char`s. But not all sequences of UTF-16 code points are valid UTF-16. The question is whether there is any way using valid Java code to have a String represent a sequence code points that is actually invalid UTF-16, or whether the language has guarantees that this can't happen. – 0xbe5077ed Aug 27 '15 at 03:58
  • excerpt from the choosen answer : "... casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it **obviously won't necessarily be a valid UTF-16** character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, ...", I take as **not guaranteed**. Just a comment though... – Ferdinand Neman Aug 27 '15 at 04:04
  • @FerdinandNeman, my question is about Strings though, not `char`s: if you take all of the `char` values from a String and put them in a line, are they guaranteed to make a valid UTF-16 sequence? – 0xbe5077ed Aug 27 '15 at 04:06
  • I think it still relevant. If you append into StringBuffer a char that were created using different encoding, the StringBuffer wont complaints. The string will contains sequence of char. When getting it using charAt(), it will simply return it as char. An unsigned 16 bit integer. You may think that its UTF-16 (as char spec says), but actually, its might not. – Ferdinand Neman Aug 27 '15 at 04:20
  • It is trivial to create a `String` instance that contains invalid UTF-16. See [this question/answer](http://stackoverflow.com/a/31622336/596219). – 一二三 Aug 27 '15 at 05:07
  • @一二三, why not then post that as the answer to this question? The problem I have with the link you provide isn't that you didn't answer my question (you did, very succinctly) but that you are *answering a different question*. – 0xbe5077ed Aug 28 '15 at 18:03

2 Answers2

5

No, an instance of a Java String is not guaranteed to contain a valid sequence of UTF-16 code units (that is, of 16-bit values) at all points during a program's execution. It really has to work this way, too.

This is trivial to prove. Imagine you have a sequence of code points (which are 21-bit quantities typically stored in 32-bit ints) that you wish to append to a String, one char unit at a time. If some of those code points lie above the Basic Multilingual Plane (that is, have values > 0xFFFF and so requiring more than 16 bits to hold them), then when adding 16-bit code units one at a time, you will have a point during which the String has only a leading surrogate but not yet the required trailing surrogate.

In other words, it works more like a char-unit buffer — a buffer of 16-bit values — than it does a legal UTF-16 sequence. This really is a necessary aspect of the String type.

Only when converting this to a particular encoding would there be any trouble, since mismatched, flipped, or lone surrogates are not legal in any of the three UTF forms, and therefore the encoder would be unable to represent them.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Now we're getting some where. The only thing this answer is missing for me to mark it as the correct answer is code! Per my question "If not, what are some simple Java-language test cases that produce not valid UTF-16?", I'd love to have a unit test that proves the assertion when run in Java... – 0xbe5077ed Aug 28 '15 at 18:06
5

No. A String is simply an unrestricted wrapper for a char[]:

char data[] = {'\uD800', 'b', 'c'};  // Unpaired lead surrogate
String str = new String(data);

To test a String or char[] for well-formed UTF-16 data, you can use CharsetEncoder:

CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(str)); // throws MalformedInputException
一二三
  • 21,059
  • 11
  • 65
  • 74