Using Unicode in a C++ source file

Question

I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.

I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:

#define MY_STRING 鸟

However, when I save, my lovely Unicode characters all become ? marks.

I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".

What's the correct way to include Unicode in C++ source files?

What compiler are you using? Many compilers (especially older compilers) do not support unicode source (most recent compilers will support universal character names, though). — James McNellis, Jul 24 '10 at 20:47
http://stackoverflow.com/questions/331690/c-source-in-unicode — Bertrand Marron, Jul 24 '10 at 20:53

score 9 · Accepted Answer · edited Apr 14 '15 at 15:41

Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.

I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.

It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.

While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.

score 5 · Answer 2 · answered Jul 24 '10 at 20:55

5

Using the L prefix and \u or \U notation for escaping Unicode characters:

Section 6.4.3 of the C99 specification defines the \u escape sequences.

Example:

 #define MY_STRING L"A \u8801 B"   
 /* A congruent-to B */

answered Jul 24 '10 at 20:55

Heath Hunnicutt

18,667
3
39
62

1

**`U+8801`** is [Unicode Han Character 'larvae, grubs'](http://www.fileformat.info/info/unicode/char/8801/index.htm). In your example, did you instead intend to use a character from [Unicode Characters in the 'Symbol, Math' Category](http://www.fileformat.info/info/unicode/category/Sm/list.htm)? – DavidRR Apr 14 '15 at 16:00
1

@DavidRR: That [makes much more sense](http://www.fileformat.info/info/unicode/char/2261/index.htm). Unicode notation is in hex. Perhaps Heath was confusing this notation with HTML's, which is decimal by default. – Jongware Apr 14 '15 at 17:30
@Jongware: Yes, good catch. Heath probably intended `\u2261` (IDENTICAL TO). – DavidRR Apr 14 '15 at 17:56
@DavidRR: See OP, "I would like to have a quoted string that contains Asian Unicode characters." I picked character 8801 because 8 is a lucky number, and the character is "Asian Unicode." I have no idea why you would think I wanted a math character instead of "larvae, grubs", given OP's context. – Heath Hunnicutt Apr 14 '15 at 19:50
@HeathHunnicutt: Your comment reads `A congruent-to B`. This lead to my interpretation that you intended the Unicode character **U+8801** to represent **congruent to**. However, as Jongware discovered, `8801` decimal is `2261` hex, and **U+2261** represents **identical to**, which arguably could carry the same semantics as **congruent to**. (And I do realize your answer is nearly 5 years old, so how you arrived at it is likely not fresh in your mind.) – DavidRR Apr 14 '15 at 20:10
BTW, **鸟**, which is the Asian character that appears in the OP's question is **U+91EF** [Unicode Han Character 'bird; KangXi radical 196'](http://www.fileformat.info/info/unicode/char/9e1f/index.htm). So, the OP could instead use `#define MY_STRING L"\u9e1f"`. – DavidRR Apr 14 '15 at 20:30
1

@DavidRR given my comment, maybe that _is_ what I meant years ago. I'm just glad my character turned out to be worms for the bird. – Heath Hunnicutt Apr 14 '15 at 20:31

score 3 · Answer 3 · edited Apr 14 '15 at 16:06

3

Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)

If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.

edited Apr 14 '15 at 16:06

DavidRR

18,291
25
109
191

answered Jul 24 '10 at 20:54

Rup

33,765
9
83
112

Using Unicode in a C++ source file

3 Answers3