To which character encoding (Unicode version) set does a char object correspond?

Question

What Unicode character encoding does a char object correspond to in:

C#
Java
JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)

In general, is there a common convention among programming languages to use a specific character encoding?

Update

I have tried to clarify my question. The changes I made are discussed in the comments below.
Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.

Don't confuse character sets with character encodings. There is one Unicode character set (ignoring revisions to the Unicode standard), meaning one defined set of characters, but many encodings like UTF-16 and UTF-8. — gatkin, Jul 01 '11 at 14:24
What problem are you trying to solve, exchanging data between software using these languages? — Kwebble, Jul 01 '11 at 14:31
This question is broken. There is no choice of "unicode character sets", there is just **the** unicode character set, and it's an abstract collection on numbers with defined semantics. — Kerrek SB, Jul 01 '11 at 15:05
What is your actual questions or intent? this will help providing a matching answer that actually helps solving your problem. — eFloh, Jul 01 '11 at 15:10
@kerreksb, @gatkin - I have updated my question. I was using "character set" to refer to "character encoding". I was confused because of the HTTP Content Type `charset` parameter. Is this formulation more appropriate/accurate? — smartcaveman, Jul 01 '11 at 17:04
@smartcave: No, that's even worse. There's no such thing as a "unicode encoding version". You should read up some of the basics perhaps. A sensible question would be, "how wide is the `char` type in these languages...", and "Are string functions in these languages encoding-aware", or something like that. Or "how can I best handle unicode strings in these languages". — Kerrek SB, Jul 01 '11 at 17:06
@kerreksb, I am looking for the term to specify a member of the set including UTF7, UTF8, UTF16 and UTF32. What is the correct terminology? — smartcaveman, Jul 01 '11 at 17:10
@smartcave: Those are "encodings". Generally, the basic types of most low-level programming languages are entirely encoding-agnostic, since an encoding establishes a higher-level semantics of a data string, namely a *textual* meaning. This is usually outside the scope of a core language and should be left to a library, since it's a very intricate and subtle topic with lots of complexity that you do not usually want to force onto the lowest-level data types. — Kerrek SB, Jul 01 '11 at 17:18

LukeH · Answer 1 · 2011-07-01T14:21:05.837

3

In C# and Java it's UTF-16.

edited Jul 01 '11 at 14:21

answered Jul 01 '11 at 14:14

LukeH

263,068
57
365
409

For Java, you can find info about this in the API documentation of class `java.lang.Character`. – Jesper Jul 01 '11 at 14:18
@leonbloy: True, although it's not really clear exactly what the OP wants to know. Are they talking about character sets, encodings, planes, or something else altogether? – LukeH Jul 01 '11 at 14:51
This isn't really an answer to the question, but the question was already broken. – Kerrek SB Jul 01 '11 at 15:04
I would argue against the directive "store your strings as raw Unicode". Unicode codepoints can still contain "indirections" - like character composition. Specialized library will still be needed. Larger strings (for majority of languages) will take longer to process, because more memory* will be needed to store them (4 times the string will need 4 times the time to load from RAM to processor). *for an euro-american language – andowero Sep 05 '21 at 16:20

Kerrek SB · Accepted Answer · 2011-07-01T17:52:02.627

I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.

At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.

The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.

Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.

Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.

Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.

Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.

Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

The last two paragraphs is exactly what I needed. Thank you. — smartcaveman, Jul 01 '11 at 17:58
Glad I could help. I don't know about C# and Java, but in PHP/Posix/C/C++, you can take advantage of the excellent `iconv` library to perform all the necessary encodings and decodings. — Kerrek SB, Jul 01 '11 at 17:59

score 0 · Answer 3 · answered Jul 01 '11 at 14:14

0

In Java:

The char data type is a single 16-bit Unicode character.

Taken from http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

In C#:

A single Unicode character

Taken from http://msdn.microsoft.com/en-us/library/ms228360(v=vs.80).aspx

answered Jul 01 '11 at 14:14

Marcelo

11,218
1
37
51

1

At least the Java sentence here seems to be from the time when Unicode still was a 16-bit character set. Now it is **one UTF-16 code unit**, which is either a character in the Basic multilingual plane or a surrogate (e.g. half a surrogate pair). (I suppose in C# it is similar.) – Paŭlo Ebermann Jul 01 '11 at 15:41
1

The phrase "single 16-bit unicode character" is a contradiction in itself. First off, it should be "unicode codepoint". Second, not every codepoint fits into 16 bits, and to represent non-BMP codepoints in 16-bit code *units* you have to use a surrogate pair. However, it's perfectly possible to store a single surrogate widow in a Java char, and its possible to store a surrogate pair that codes a value that has not been assigned to any code point. **Basic data types only store numbers without implicit semantics.** – Kerrek SB Jul 01 '11 at 17:10

To which character encoding (Unicode version) set does a char object correspond?

Update

3 Answers3

Linked

Related