How can I store a unicode in C?

Question

I am trying to store a Unicode codepoint inside a variable in C. I tried using wchar_t, however since the Unicode codepoint I am trying to store is U+1F319, it doesn't fit in wchar_t. How can I get around this? I'm using a Windows computer.

#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>

int main(void){

    setlocale(LC_ALL,"en_US.UTF-8");

    unsigned long long x = 0x1F319;
    wchar_t wc =L'\U0001f319';
    wprintf(L"%lc",wc);

    return EXIT_SUCCESS;
}

The following code gives this error:

Unicode.c:12:14: warning: character constant too long for its type
wchar_t wc =L'\U0001f319';

The Unicode codepoint you are trying to store would not fit in a single `wchar_t` if `sizeof(wchar_t) < 4`, which is the case on Windows (where `sizeof(wchar_t) == 2`). In which case, you would need to encode the codepoint using [UTF-16](https://en.wikipedia.org/wiki/UTF-16) and store the resulting values in 2 `wchar_t`s acting together. The UTF-16 encoded form of `U+1F319` is `0xD83C 0xDF19`. For instance: `const wchar_t *wc = L"\uD83C\uDF19"; wprintf(L"%s", wc);` — Remy Lebeau, Dec 15 '18 at 01:20
If your compiler supports `char32_t` from C11 that'd be ideal, but otherwise an int. And look into Unicode libraries that can convert between different encodings like utf-32 to utf-16. — Shawn, Dec 15 '18 at 01:21
you cant do that take a look to this response https://stackoverflow.com/questions/2259544/is-wchar-t-needed-for-unicode-support — Ayoub Benayache, Dec 15 '18 at 01:22
`wchar_t` can only holds 16 bit, but you're missing the `\u` after the first 8 bits, you might want to try with: `\uD83C\uDF19` — Miguel Ruivo, Dec 15 '18 at 01:23
@RemyLebeau I changed the en_US.UTF-8 to en_US.UTF-16 and then implemented your suggestion but got this error: \uD83C is not a valid universal character — coderhk, Dec 15 '18 at 01:44
@coderhk `setlocale(LC_ALL,"en_US.UTF-8");` should be `setlocale(LC_ALL,"");` or at least `setlocale(LC_TYPE,"");`, but that has no effect on the compiler. `L"\uD83C\uDF19"` is perfectly valid code for a `wchar_t` string literal on Windows. Which compiler are you using? — Remy Lebeau, Dec 15 '18 at 04:05
@RemyLebeau I am using gcc. Would it be possible to bypass this problem by allocating memory from the heap instead? — coderhk, Dec 15 '18 at 05:17
Storing a Unicode code point is not a problem. You just have to decide how you want to represent it, which will probably depend on what you are planning to do with it. The most popular choices are UTF-8, UTF-16, and UTF-32. If you choose UTF-8 you will need 1-4 8-bit integers to store the code point. If you choose UTF-16, will need 1-2 unsigned 16-bit integers, and if you choose UTF-32 you will need 1 32-bit integer. You also need to know that Windows functions usually expect unicode text, passed to them, to be represented as UTF-16. — Stuart, Dec 15 '18 at 06:28
Read http://utf8everywhere.org/ and perhaps use [libunistring](https://www.gnu.org/software/libunistring/) — Basile Starynkevitch, Dec 15 '18 at 06:38
@coderhk memory allocation is not an issue. The stack can also be used. For example: `wchar_t wc[] = { 0xD83C, 0xDF19, 0x0000}; wprintf(L"%s", wc);` — Remy Lebeau, Dec 15 '18 at 07:06

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

How can I store a unicode in C?

Since C11, "to store a Unicode codepoint", use char32_t @Shawn

#include <uchar.h>

char32_t ch1 = 0x1F319;
char32_t ch2 = U'\U0001f319';

Works on my Windows computer. ref

char32_t

which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t... C11 §7.27 2

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 15 '18 at 06:15

chux - Reinstate Monica

143,097
13
135
256

How can I store a unicode in C?

1 Answers1