1

I am trying to store a Unicode codepoint inside a variable in C. I tried using wchar_t, however since the Unicode codepoint I am trying to store is U+1F319, it doesn't fit in wchar_t. How can I get around this? I'm using a Windows computer.

#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>

int main(void){

    setlocale(LC_ALL,"en_US.UTF-8");

    unsigned long long x = 0x1F319;
    wchar_t wc =L'\U0001f319';
    wprintf(L"%lc",wc);

    return EXIT_SUCCESS;
}

The following code gives this error:

Unicode.c:12:14: warning: character constant too long for its type
wchar_t wc =L'\U0001f319';

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
coderhk
  • 286
  • 2
  • 14
  • The Unicode codepoint you are trying to store would not fit in a single `wchar_t` if `sizeof(wchar_t) < 4`, which is the case on Windows (where `sizeof(wchar_t) == 2`). In which case, you would need to encode the codepoint using [UTF-16](https://en.wikipedia.org/wiki/UTF-16) and store the resulting values in 2 `wchar_t`s acting together. The UTF-16 encoded form of `U+1F319` is `0xD83C 0xDF19`. For instance: `const wchar_t *wc = L"\uD83C\uDF19"; wprintf(L"%s", wc);` – Remy Lebeau Dec 15 '18 at 01:20
  • 3
    If your compiler supports `char32_t` from C11 that'd be ideal, but otherwise an int. And look into Unicode libraries that can convert between different encodings like utf-32 to utf-16. – Shawn Dec 15 '18 at 01:21
  • you cant do that take a look to this response https://stackoverflow.com/questions/2259544/is-wchar-t-needed-for-unicode-support – Ayoub Benayache Dec 15 '18 at 01:22
  • `wchar_t` can only holds 16 bit, but you're missing the `\u` after the first 8 bits, you might want to try with: `\uD83C\uDF19` – Miguel Ruivo Dec 15 '18 at 01:23
  • @RemyLebeau I changed the en_US.UTF-8 to en_US.UTF-16 and then implemented your suggestion but got this error: \uD83C is not a valid universal character – coderhk Dec 15 '18 at 01:44
  • @coderhk `setlocale(LC_ALL,"en_US.UTF-8");` should be `setlocale(LC_ALL,"");` or at least `setlocale(LC_TYPE,"");`, but that has no effect on the compiler. `L"\uD83C\uDF19"` is perfectly valid code for a `wchar_t` string literal on Windows. Which compiler are you using? – Remy Lebeau Dec 15 '18 at 04:05
  • @RemyLebeau I am using gcc. Would it be possible to bypass this problem by allocating memory from the heap instead? – coderhk Dec 15 '18 at 05:17
  • Storing a Unicode code point is not a problem. You just have to decide how you want to represent it, which will probably depend on what you are planning to do with it. The most popular choices are UTF-8, UTF-16, and UTF-32. If you choose UTF-8 you will need 1-4 8-bit integers to store the code point. If you choose UTF-16, will need 1-2 unsigned 16-bit integers, and if you choose UTF-32 you will need 1 32-bit integer. You also need to know that Windows functions usually expect unicode text, passed to them, to be represented as UTF-16. – Stuart Dec 15 '18 at 06:28
  • 2
    Read http://utf8everywhere.org/ and perhaps use [libunistring](https://www.gnu.org/software/libunistring/) – Basile Starynkevitch Dec 15 '18 at 06:38
  • @coderhk memory allocation is not an issue. The stack can also be used. For example: `wchar_t wc[] = { 0xD83C, 0xDF19, 0x0000}; wprintf(L"%s", wc);` – Remy Lebeau Dec 15 '18 at 07:06

1 Answers1

2

How can I store a unicode in C?

Since C11, "to store a Unicode codepoint", use char32_t @Shawn

#include <uchar.h>

char32_t ch1 = 0x1F319;
char32_t ch2 = U'\U0001f319';

Works on my Windows computer. ref


char32_t

which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t... C11 §7.27 2

Community
  • 1
  • 1
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256