Understanding and writing wchar_t in C

Question

I'm currently rewriting (a part of) the printf() function for a school project. Overall, we were required to reproduce the behaviour of the function with several flags, conversions, length modifiers ...

The only thing I have left to do and that gets me stuck are the flags %C / %S (or %lc / %ls).

So far, I've gathered that wchar_t is a type that can store characters on more than one byte, in order to accept more characters or symbols and therefore be compatible with pretty much every language, regardless of their alphabet and special characters.

However, I wasn't able to find any concrete information on what a wchar looks like for the machine, it's actual length (which apparently vary based on several factors including the compiler, the OS ...) or how to actually write them.

Thank you in advance

Note that we are limited in the functions we are allowed to use. The only allowed functions are write(), malloc(), free(), and exit(). We must be able to code any other required function ourselves.

To sum this up, what I'm asking here is some informations on how to interpret and write "manually" any wchar_t character, with as little code as possible so that I can try to understand the whole process and code it myself.

I would start by narrowing down what `wchar_t` can mean in your situation. On most *nix systems this would mean UTF-32. On Windows it means UTF-16. After that you need to decide what your narrow `char` is going to be. On most *nix systems it means UTF-8. The good news is that converting between Unicode representations is very well defined. — Mgetz, Dec 10 '14 at 12:48
@Mgetz - It appears to be UTF-32 (MAC OSX at school. I'll try on debian at home). So if I got your answer right, my goal is to try to convert a UTF-32 char into a UTF-8 one, is that correct ? — kRYOoX, Dec 10 '14 at 14:26
@kRYOoX my comment was to provide guidance, not do your homework for you. — Mgetz, Dec 10 '14 at 14:27

score 16 · Accepted Answer · edited Oct 19 '20 at 06:05

16

A wchar_t is similar to a char in the sense that it is a number, but when displaying a char or wchar_t we don't want to see the number, but the drawn character corresponding to the number. The mapping from the number to the characters aren't defined by neither char nor wchar_t, they depend on the system. So there is no difference in the end usage between char and wchar_t except for their sizes.

Given the above, the most trivial implementation of printf("%ls") is one where you know what are the system encodings for use with char and wchar_t. For example, in my system, char has 8 bits, has encoding UTF-8, while wchar_t is 32 bits and has encoding UTF-32. So the printf implementation just converts from UTF-32 to UTF-8 and outputs the result.

A more general implementation must support different and configurable encodings and may need to inspect what's the current encoding. In this case functions like wcsnrtombs() or iconv() must be used.

edited Oct 19 '20 at 06:05

lavalade

329
2
11

answered Dec 13 '14 at 20:21

hdante

7,685
3
31
36

Actually, if `__STDC_ISO_10646__` is defined, `wchar_t` should store Unicode codepoint values, as of the date specified in that macro. See ISO C 6.10.8.2 – ninjalj Dec 13 '14 at 21:42
And if __STDC_ISO_10646__ is not defined, then wchar_t need not store Unicode codepoint values. – hdante Dec 13 '14 at 21:46
This is pretty much what I guessed based on @Mgetz comment to my question. Thank you for confirming it though. With some more reading on Unicode encoding and how to manipulate it, I was able to implement what I needed. – kRYOoX Dec 17 '14 at 13:11

Understanding and writing wchar_t in C

1 Answers1

Linked