1

I have a program in Linux (Ubuntu 13.04) in C.

#include<stdio.h>

int main()
{
    char* cp = "ӐҖ";
    printf("%s\n",cp);
    printf("%d\n",sizeof(*cp));
    printf("%d\n",(unsigned int)*cp);
    return 0;
}

The first and second printf has ouput of:

ӐҖ
1

respectively.

1.) My first concern is that, in the 3rd printf, I tried to cast the character to unsigned int in an attempt to see that unicode codepoint that represents that first character but I am getting -45. What should be the best approach I should use to see the unicode codepoint of a single unicode character that is represented by a 1 byte "char" data type?

2.) Second concern, when I port this code to Windows 7, [char* cp = "ӐҖ";] will result to a compiler "warning C4566 : character represented by universal-character name '\uFFE6' cannot be represented in the current code page (932)". When I run it,the output are:

??
1

Does Windows don't support unicode in "char" data type? Then what character data type should I use to make my code portable from Linux to Windows?

wembikon
  • 91
  • 2
  • 9

1 Answers1

1

C doesn't support Unicode. Neither does C++. There are libraries for that if you're interested, or you can hand-roll your own routines if you need.

char in C is not a "character" type, it's a byte type. I'm assuming you wrote your source code in UTF-8.

GCC interprets bytes in string literals literally. You have defined a sequence of 5 bytes: d3 90 d2 96 00. (d3 interpreted as signed char is -45.) You can try to use strlen, it should return 4. Most Unix and C API's are byte-oriented, so when you print out those bytes, what shows up on the screen depends on the encoding used by your terminal emulator. Usually it's UTF-8, so everything works.

If the source is in UTF-8, MSVC treats string and char literals as what you want to be displayed – that is, as text – and then encodes them in your system's default codepage. So if you write "à", it will be reencoded to e0 00 if you use CP-1252. If you use encoding that has no à (for example you use CP-1250, which has ŕ at e0), you will get a question mark.

But how MSVC knows what text was in the file? It looks for UTF-8 BOM. If your text file doesn't start with a BOM, MSVC assumes the encoding of the file is the default system encoding and doesn't try converting anything – it leaves the bytes as it saw them, just like GCC.

(Note: I see you use ShiftJIS; it may cause problems as it's not ASCII-compatible and I don't know how MSVC handles it. Proceed with caution.)

If you need to handle Unicode text and use MSVC, you can also use wide string literals. GCC supports them as well, although it lacks many library functions that can use them. But I'm a strong supporter of UTF-8 manifesto and I recommend using UTF-8 strings as often as possible.

Note that if you remove BOM, you no longer can use wide string literals in MSVC.

EDIT: see here for more discussion and experiences of Asian developers with MSCV: How to create a UTF-8 string literal in Visual C++ 2008 Long story short: it's not pretty.

Community
  • 1
  • 1
Karol S
  • 9,028
  • 2
  • 32
  • 45
  • 2
    C++11 adds more support for Unicode to the core C++ language beyond just *wide literals* (via the `L` literal prefix). It introduces new data types for 16/32-bit characters, new prefixes for UTF-8/16/32 literals, new `basic_string` typedefs for UTF-16/32 strings, and new `std::codecvt` types for UTF data conversions. – Remy Lebeau Jun 02 '14 at 21:26