1

In C, I specify a Unicode character with the form:

"\uCODEPOINT"

However, I can't find any details on how that is stored. Is it UTF-8, 16, 32? Is there a notation to specify UTF-8 encoding, or do I have to write each byte in hexadecimal?

Marco Scannadinari
  • 1,774
  • 2
  • 15
  • 24
  • It's probably UTF-8 variable width, but I'd like to know as well. The only info I can find pertains to C++, at http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11. – A Person Jan 15 '14 at 20:43
  • It's well documented that \u is followed by 4 hex digits. – Jim Balter Jan 15 '14 at 20:48
  • I think you need to review Unicode concepts. Start at http://www.unicode.org/ You will find that "Is there a notation for UTF-8 characters" is incoherent -- UTF-8 is an encoding for unicode code points. – Jim Balter Jan 15 '14 at 20:51

2 Answers2

3

\uXXXX is a (short form) universal character name. You can use, say, \u0041 anywhere in your program in place of A -- this can be in the source text, e.g., as part of an identifier, or it can be in a character or string literal. If you use it in a literal, it will be exactly the same as if you used A in that literal. The same applies to the names of characters with encodings longer than 8 bits ... you can use the universal name, or you can enter the character directly if you have an input method that allows you to. How the character is encoded in memory is implementation-dependent, dependent on whether the character appears in an "" or L"" literal, and dependent on whether the character is a member of the execution character set. Note this from the C standard:

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.)

In an implementation that uses the UTF-8 encoding to represent non-wide strings, then \uXXXX appearing in a non-wide string literal will of course be encoded in UTF-8, along with all the other characters in the literal. If the \uXXXX occurs in a wide string literal, it will be encoded as a wide character with value 0xXXXX.

Jim Balter
  • 16,163
  • 3
  • 43
  • 66
  • 1
    I believe I can prepend the literal with "u8" to specify UTF-8 encoding in C11, but how would I have done that in earlier standards? – Marco Scannadinari Jan 16 '14 at 16:25
  • 1
    To portably create UTF-8 literals in earlier versions of C you would have to manually enter the hex values, as you mention in your (edited) question. However, some implementations, such as some versions of gcc and Visual Studio, support UTF-8 encoding of "narrow" strings. Check the documentation of your implementations ... and you can of course print out the bytes of the strings or just look at the generated assembly code to see how characters are being encoded. – Jim Balter Jan 17 '14 at 00:48
1

However, I can't find any details on how that is stored.

The execution character set is implementation dependent. However, some compilers do have some sort of options to change it if the default is not what you want. The C11 standard has additional ways to specify Unicode string literals in UTF encodings (e.g. u8"Hello").

Community
  • 1
  • 1
ldav1s
  • 15,885
  • 2
  • 53
  • 56