1

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.

So I ask:

  1. Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
  2. Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?

Thanks in advance.

IAmYourFaja
  • 55,468
  • 181
  • 466
  • 756
  • 4
    UTF16 **is** an encoding. It can encode **all** Unicode character. Beware that surrogate pairs are difficult to work with. – SLaks Feb 26 '13 at 21:26
  • 5
    Detecting the encoding of a file without a BOM is not trivial. – SLaks Feb 26 '13 at 21:26
  • Thanks @SLaks (+1 on both) - I assume though that there are encodings/sets outside of Unicode? What if we're reading a file that was created with one of these non-Unicode encodings? That is really at the heart of my question. – IAmYourFaja Feb 26 '13 at 21:35
  • ...for example, what if we encounter something outside the [`StandardCharsets`](http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html)? – IAmYourFaja Feb 26 '13 at 21:40

2 Answers2

1

Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.

To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.

It depends on the used encoding which characters are representable.

Henry
  • 42,982
  • 7
  • 68
  • 84
1

Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).

UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.

If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.

So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.

As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.

Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.

Community
  • 1
  • 1
Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • The beginning is misleading, you say _[Java] uses Unicode code points_ (which is correct), but before that you say _Java does not use UTF-16_ (which is wrong). Each `char` represents a Unicode code point, encoded in UTF-16 – jlordo Feb 26 '13 at 21:49
  • Thanks @owlstead (+1) - so there's no way to check that each character (as I write it to file) isn't a valid character in the designated encoding? Nothing like `Character.verifyEncoding(someChar, "UTF-16");`??? – IAmYourFaja Feb 26 '13 at 21:51
  • Now it's more worse. You say: "_Java does not use UTF-16 for its internal representation of characters_" whereas you should say "_Java **uses** UTF-16 for its internal representation of characters_" – jlordo Feb 26 '13 at 21:52
  • See the [Documentation](http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html) It clearly states: **The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.** – jlordo Feb 26 '13 at 21:54
  • @jlordo OK, fixed it best I could without making this into a whole chapter on string encoding, basically as user you just see char's though, which makes the whole point a bit moot. When encoding you want to know how the text looks as bytes, and Java may not represent the characters in memory as you would expect. Not that it matters because you cannot get to the memory representation in the first place. – Maarten Bodewes Feb 26 '13 at 21:58
  • That's great guys, I'm glad you both worked it out. But what about my suggestion for verifying each character against a specific encoding? – IAmYourFaja Feb 26 '13 at 21:59
  • @DirtyMikeAndTheBoys Check the `CharsetDecoder` class first, and see if you understand how it works. That might answer your question. – Maarten Bodewes Feb 26 '13 at 22:00
  • So, you mean define event handlers for malformed inputs (for a specific charset); and then, if such an event happens, I know that a bad (unsupported) char got in there somehow? Is that what you mean? Thanks again! – IAmYourFaja Feb 26 '13 at 22:01
  • @DirtyMikeAndTheBoys Yup, that's basically it. Beware that the methods defined for `String` accept invalid characters without so much as a warning, you *should* use `CharsetEncoder` and `CharsetDecoder` for applications that have requirements like your application has. – Maarten Bodewes Feb 26 '13 at 22:05
  • Minor note: as UTF-16 uses "units" (may have that term wrong) of 16 bits you can encode it using little endian (UTF-16LE) and big endian. Then you may use a BOM header at the start of the file or not. So just choosing UTF-16 still leaves you with a choice or two. – Maarten Bodewes Feb 26 '13 at 22:10
  • @jlordo it was still not up to standards (at least, my standards) so I rewrote the entire first paragraph, if you have remarks please share, thanks for the criticism up to now. – Maarten Bodewes Feb 26 '13 at 22:18
  • @owlstead: It has no more wrong information. I don't understand the sentence _char arrays are semantically speaking a form of UTF-16 encoding_. I'd simply put _In Java char arrays are a sequence of UTF-16 encoded Unicode codepoints._ – jlordo Feb 26 '13 at 22:34
  • @jlordo OK, I'll change it to that. I still have the urge to make clear that the internal memory representation may *not* be UTF-16 as it is an implementation detail. – Maarten Bodewes Feb 26 '13 at 22:45
  • @owlstead: Apart from memory optimizations in newer Versions of Java and serialized Strings it is UTF-16 internally (in Oracle Java and all other versions I know of). Do you know any exception to that rule? – jlordo Feb 26 '13 at 22:53
  • @jlordo no, but I would not be surprised if there were a few (possibly in the future). That said, it would probably be a nightmare to rewrite the String if you want to use it with a native library. I'm currently programming Java Card classic, and it certainly encodes strings differently (it doesn't :P) For the asker it doesn't matter much, the internal representation should be out of reach. – Maarten Bodewes Feb 26 '13 at 22:58