2

I'm doing a rewrite of this question.

I want to create a string with a unicode escaped character such as "\u03B1" using an integer constant. For example, this string is the greek letter alpha.

const char *alpha = "\u03B1"

I want to construct the same string using a call to printf using the integer value 0x03B1. For this example it can be done like this but I'm not sure to get those two numbers from 0x03B1.

printf("%c%c", 206, 177);

This link explains what to do but I'm not sure how to do it. http://www.fileformat.info/info/unicode/utf8.htm

For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes. The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).

NOTE: I do not want to create the string "\\u03B1" with a backslash. This is different than "\u03B1" which is an escaped unicode character.

Berry Blue
  • 15,330
  • 18
  • 62
  • 113
  • C or C++, pick one because answers will vary wildly. – Borgleader Nov 06 '14 at 20:06
  • `printf("\\u%04x", 1234);` – rslemos Nov 06 '14 at 20:24
  • 2
    Clarify whether you want to end up with the string `'\', 'u', '1', '2', '3', '4', '\0'`, or whether you are trying to build a single character of code point U+1234 – M.M Nov 06 '14 at 20:56
  • 1
    Also, does your console directly support wide Unicode charcaters, or do you need to output UTF8? – Jongware Nov 06 '14 at 21:28
  • I removed the c++ tag. @rslemos, close but I want the string to be "\u1234" not "\\u1234". – Berry Blue Nov 07 '14 at 01:09
  • 3
    If you want a backslash, you need to escape it, so type two of them. does `printf("\\u%04x", 0x1234)` do what you want? – yellowantphil Nov 07 '14 at 01:23
  • @BerryBlue go on and try it. – rslemos Nov 07 '14 at 01:39
  • After investigating further, it looks like the `\u` syntax is C99. Unfortunately, I haven't found a way to do what you're asking. I don't know of a good C99 reference, off the top of my head. Sorry. What you are describing won't work at all in C89. – yellowantphil Nov 07 '14 at 02:37
  • Maybe you could add tags for `unicode` and `c99` to get more people who might be able to answer your question. Or maybe change `c` to `c++`. – yellowantphil Nov 07 '14 at 02:39

2 Answers2

3

It appears that even the most recent C and C++ standards are a bit disappointing in their handling of Unicode.

For those who are confused about the example in the question, like I was:

const char *alpha = "\u03B1"

In C99, this will store a pointer to the string "α" (U+03B1) in alpha. In C89, this is invalid syntax.

I could not find a way to use the \u syntax with a variable or integer constant, like what the question was requesting. You may be better off using a library to add better Unicode support to your program. I have not used the ICU library, but it sounds promising.

Community
  • 1
  • 1
yellowantphil
  • 1,483
  • 5
  • 21
  • 30
  • Sorry, please see my edited question above. I want to create an escaped unicode character not a string with a backslash. – Berry Blue Nov 07 '14 at 01:49
1

I figured it out.

The first byte contains the 5 upper bits 0x7c0 is 11111000000 and the second byte contains the lower 5 bits 0x3f is 00000111111 of the unicode value.

The first byte uses the mask 0xc0 is 11000000 to set the two high bits and the second byte uses 0x80 is 10000000 to set the first high bit.

int alpha = 0x03B1; // 945
char byte1 = 0xc0 | ((alpha & 0x7c0) >> 6); // 206
char byte2 = 0x80 | (alpha & 0x3f); // 177
printf("%c%c", byte1, byte2);
Berry Blue
  • 15,330
  • 18
  • 62
  • 113