1
//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char. 
System.out.println(ch); // the same number output to 

"java internal encoding is UTF16". Where does it meanfully come to play in that?

Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?

I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)

So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???

P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs

й
1081

with win-1251 encoding of source file and UTF-8 in console outputs

�
65533

Same output if we use String instead of char...

String s = "й";
System.out.println(s);

In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???

If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte). Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).

Am I right, or am I a big idiot? Sorry for asking in latter case :)))

SO research shows no direct answer to my question:

LrnBoy
  • 373
  • 1
  • 2
  • 9
  • 1
    "Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output)" you've got it the wrong way round: given *some console*, you need to know the character encoding *of that console* in order to display a character correctly (if it's even possible). A char is unambiguously in UTF-16. – Andy Turner Jun 13 '17 at 11:35
  • Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in? – LrnBoy Jun 13 '17 at 11:48
  • 2
    @LrnBoy yes, exactly. – Henry Jun 13 '17 at 11:49

2 Answers2

0

In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.

The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.

Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.

Henry
  • 42,982
  • 7
  • 68
  • 84
0

Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)

While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.

So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.

This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.

If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)

If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.


So how does this relate to UTF-8, windows-1251 and so on?

Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.

This doesn't always work.

  • Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.

  • If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).


You asked:

Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?

Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)

Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.

However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216