Java - UTF8/16 is a Charset Name or Character Encoding?

Question

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.

My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.

Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?

How would you "create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name"? — Jon Skeet, Mar 11 '13 at 20:51
@ Jon Skeet...so how do the files get encoded? I thought it uses the OS default character encoding....correct? — user547453, Mar 11 '13 at 20:55
I can't answer that without seeing some code. I'd normally use FileOutputStream wrapped in an OutputStreamWriter, so it'll use whatever encoding I specify :) — Jon Skeet, Mar 11 '13 at 20:57
@JonSkeet...even OutputStreamWriter using the same OutputStreamWriter(OutputStream out, Charset cs) where Charset is like UTF-8/16. Where do we specify the encoding like 'Cp1252' OR 'ISO-8859-1'? — user547453, Mar 11 '13 at 21:03

Jon Skeet · Accepted Answer · 2013-03-11T20:56:47.350

11

They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.

Actually, by Unicode terminology they're probably most accurately character encoding schemes:

A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

Where a character encoding form is:

Mapping from a character set definition to the actual code units used to represent the data.

Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).

edited Mar 11 '13 at 20:56

answered Mar 11 '13 at 20:51

Jon Skeet

1,421,763
867
9,128
9,194

3

+1. [The Javadoc for `java.nio.charset.Charset`](http://docs.oracle.com/javase/1.5.0/docs/api/java/nio/charset/Charset.html) explains that by "charset" the JDK means "A named mapping between sequences of sixteen-bit Unicode code units and sequences of bytes". – ruakh Mar 11 '13 at 20:54
2

By the way, [RFC 2978](http://www.ietf.org/rfc/rfc2978.txt) explains some of the rationale behind this nomenclature. It's not a Java-ism, but rather a standards-ism. (Though Java made it a bit worse by applying charsets to *code units* rather than to *characters*.) – ruakh Mar 11 '13 at 21:05

score 1 · Answer 2 · answered Mar 11 '13 at 21:02

1

I think those two things are not directly related.

The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.

The java String(byte[] bytes, String charsetName) is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.

answered Mar 11 '13 at 21:02

Shaohong Li

58
4

@Shaohong...so if a file is created in Eastern European language how do I read it? My default charset encoding is Cp1252. Which method should I use? – user547453 Mar 11 '13 at 21:06
It depends on what kind of file you're dealing with. Supposedly you have to know (or assume some default) character encoding format. If it is some formatted file, this should be given somewhere in the file header part. If it is pure text, then you have to depend on "out of band" way to know the encoding. I think this two links have some related information: http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file and http://en.wikipedia.org/wiki/Text_file – Shaohong Li Mar 11 '13 at 21:09
1

@user547453 Forget your default charset encoding. I think this link http://docs.oracle.com/javase/tutorial/i18n/text/stream.html is your friend to understand how to read/write text file in a given character encoding scheme. – Shaohong Li Mar 11 '13 at 21:33

score 1 · Answer 3 · answered Mar 11 '13 at 22:00

A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.

Java - UTF8/16 is a Charset Name or Character Encoding?

3 Answers3