wcstombs: character encoding?

Question

wcstombs documentation says, it "converts the sequence of wide-character codes to multibyte string". But it never says what is a "wide-character".

Is it implicit, like say it converts utf-16 to utf-8 or the conversion is defined by some environment variable?

Also what is the typical use case of wcstombs?

A "wide-character" is a `wchar_t`. – kennytm Feb 03 '10 at 07:08 — kennytm, Feb 03 '10 at 07:08

Michael Burr · Accepted Answer · 2010-02-03T08:18:37.883

You use the setlocale() standard function with the LC_CTYPE (or LC_ALL) category to set the mapping the library uses between wchar_t characters and multibyte characters. The actual locale name passed to setlocale() is implementation defined, so you'll need to look it up in your compiler's docs.

For example, with MSVC you might use

setlocale( LC_ALL, ".1252" );

to set the C runtime to use codepage 1252 as the multibyte character set. Note that MSVC docs explicitly indicates that the locale cannot be set to UTF-7 or UTF8 for the multibyte character sets:

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.

The "wide-character" wchar_t type is intended to be able to support any character set the system supports - the standard doesn't define the size of a wchar_t type (it could be as small as a char or any of the larger integer types). On Windows it's the system's 'internal' Unicode encoding, which is UTF-16 (UCS-2 before WinXP). Honestly, I can't find a direct quote on that in the MSVC docs, though. Strictly speaking, the implementation should call this out, but I can't find it.

Warning: there is no standard for the locale string in setlocale, so it is not easy to do anything cross-platform. For instance .1252 is valid on Windows, but not on UNIX/Linux (there you will see stuff like en_US.UTF-8 or en_US.iso889-1) — Mihai Nita, Feb 03 '10 at 08:50

score 3 · Answer 2 · answered Feb 03 '10 at 07:18

3

It converts whatever your platform uses for a "wide char" (which I'm lead to believe is indeed UCS2 on Windows, but is usually UCS4 on UNIX) into your current locale's default multibyte character encoding. If your locale is a UTF-8 one, then that is the multibyte encoding that will be used - but note that there are other possibilities, like JIS.

answered Feb 03 '10 at 07:18

caf

233,326
40
323
462

3

On Windows that is UTF-16, not UCS2. – Mihai Nita Feb 03 '10 at 08:48
Fair enough. (That seems somewhat broken - the whole point of widechars was supposed to be that one widechar is always exactly one character). – caf Feb 03 '10 at 22:10
1

That's never true. Even a 32-bit widechar on Linux might represent a non-printing element such as part of a decomposed accented character, or a RTL ordering directive, or all sorts of other things. So it's never safe to assume that one code point is one character, no matter the encoding. – Miral Dec 14 '17 at 00:49

Sam Post · Answer 3 · 2010-02-03T07:22:40.343

1

Wide character strings are composed of multi-byte characters, whereas the normal C string is a char* - a sequence of byte-wide characters. Wchars are not the same thing as unicode on all platforms, though unicode representations are typically based on wchar_t

I've seen wchars used in embedded systems like phones, where you want filenames with special characters but don't necessarily want to support all the glory and complexity of unicode.

Typical usage would be converting a 2-byte based string to a regular C string, and vica versa

edited Feb 03 '10 at 07:22

answered Feb 03 '10 at 07:05

Sam Post

3,721
3
17
14

This is perhaps a bit confusing - in this and similar usages, a "multi-byte string" is a string made of chars - a "standard ansi c-string", but where there may be more than one char (byte) per logical character, whereas a wide string typically allots more than 1 byte per element (sizeof(wchar_t)==2 is common), often initially in the mistaken belief that this would allow number of logical characters in a string to equal number of elements. – Ryan Pavlik Sep 10 '15 at 18:35

score 1 · Answer 4 · answered Feb 03 '10 at 07:18

According to the C standard, wchar_t type is "capable of representing any character in the current locale". The standard doesn't say what the encoding for wchar_t is. In fact, the limits on WCHAR_MIN and WCHAR_MAX are [0, 255] or [-127, 127], depending upon whether wchar_t is unsigned or signed.

A multibyte character can use more than one byte. A multibyte string is made of one or more multibyte characters. In a multibyte string, each character need not be of equal number of bytes (UTF-8 is an example). Whereas, an object of type wchar_t has a fixed size (in a given implementation, of course).

As an aside, I can also find the following in my copy of the C99 draft:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month.

So, if I understood correctly, if __STDC_ISO_10646__ is defined, then wchar_t can store Unicode characters.

Actual limit on `WCHAR_MAX` is not `255` (You probably confuse with `char` type). According to `c11` (`c99` also have same description): `value of **WCHAR_MAX** shall be no less than 255.`. Real value may be `2147483647`. Live example [here](http://melpon.org/wandbox/permlink/zQmKmfSJET4nHkcY). I don't ever seen if it was `255`. — αλεχολυτ, Feb 20 '16 at 11:22

wcstombs: character encoding?

4 Answers4

Linked