What is the locale of UTF-8?

Question

I am getting an XML file from the Facebook API with the data:

<?xml version="1.0" encoding="UTF-8"?> 
<fql_query_response xmlns="api.facebook.com/1.0/"; xmlns:xsi="w3.org/2001/XMLSchema-instance"; list="true"> 
    <user> 
        <uid>100000022063315</uid> 
        <name>0xD7 0x99 0xD7 0x95 0xD7 0x97 0xD7 0x90 0xD7 0x99 0x20 0xD7 0x95 0xD7 0x9B 0xD7 0x98 0xD7 0xA8</name> 
    </user>
</fql_query_response>

I want to translate the UTF-8 to wchar_t. I am trying to do so with mbstowcs but apparently I need to know what locale to set. Is there a standard locale for Facebook? or for UTF-8?

UTF-8 doesn't have a "locale". It is just an [encoding](http://www.unicode.org/reports/tr17/) for [Unicode](http://en.wikipedia.org/wiki/Unicode) (maps Unicode codepoints onto one or more bytes, often for transmission) -- what "locale" is Unicode? — , Jun 24 '11 at 15:38
Any locale ending in ".utf8" will do, e.g. "en_US.utf8". Say `setlocale(LC_CTYPE, "en_US.utf8");` before you do `mbsrtowcs` and it should work. Alternatively use iconv going from UTF8 to WCHAR_T. — Kerrek SB, Jun 24 '11 at 15:42
I would set it so that wchar_t strings are UTF-16 (or UTF-32 depending) — Martin York, Jun 24 '11 at 16:46
@Kerrek SB: That's a Linux locale. On Windows, [If you provide a code page like UTF-7 or UTF-8, `setlocale` will fail, returning `NULL`](http://msdn.microsoft.com/en-us/library/x99tb11d.aspx) — MSalters, Jun 25 '11 at 14:06
@Martin: You cannot control the _result_ of `mbstowcs`. It's just some implementation-defined fixed-width string. @MSalters. Good point. Use iconv (from UTF8 to WCHAR_T), as I suggest below. — Kerrek SB, Jun 25 '11 at 14:11

score 5 · Accepted Answer · answered Jun 24 '11 at 16:36

5

To translate data that's not associated with the user's configured locale, but rather an explicitly specified encoding, you should use iconv, not mbsrtowcs. You don't need setlocale at all for this.

answered Jun 24 '11 at 16:36

R.. GitHub STOP HELPING ICE

208,859
35
376
711

score 4 · Answer 2 · answered Jun 24 '11 at 15:45

4

As @pst notes, the terminology here is a bit wrong. "Locale" is used sometimes to refer to which ANSI code page is used to represent international text when unicode is not available.

Read Joel Spolsky's fantastic "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

Now, to answer your question, if you need to convert UTF-8 encoded text to UTF-16 (or what in Windows is commonly called "wide char") you can use a function such as MultiByteToWideChar with the parameter CP_UTF8

answered Jun 24 '11 at 15:45

Assaf Lavie

73,079
34
148
203

Yes, im running on Ubuntu though, hence the need for mbstowcs – chacham15 Jun 24 '11 at 15:57
1

Should also note [this article](http://utf8everywhere.org) which is kind of an answer to Joel's article above. – Qix - MONICA WAS MISTREATED Jan 24 '13 at 19:17

score 0 · Answer 3 · edited May 23 '17 at 12:27

Here is a little discussion I started a while ago on this subject.

Basically, I would personally distinguish two separate paths on encoding handling:

One is an encoding-agnostic, "internally portable" path that using mbstowcs to convert the external multibyte data from char * argv[] and convert it into an internal, fixed-width wide string, all without ever talking about encodings.
The other is a fixed-encoding, serializable path that deals with data that ships in deterministic encodings. To translate among those, the Posix iconv library does the trick.
You can bridge between the two paths by using iconv's special WCHAR_T encoding.

Since the situation that you describe requires you to read serialized, deterministic data, I would suggest using iconv to convert FROM UTF8 (which you know you have) and convert TO WCHAR_T, which you can then treat with your standard C wide string functions (but don't make assumptions about the actual encoding). If you need to print data to the console, you can always wcstombs from your internal wide strings to a multibyte representation (the details of which are again not of your concern) that the console told you it wants.

@R.: True; `printf("%ls")` will take care of the conversion for you. Good point. — Kerrek SB, Jul 01 '11 at 17:22

What is the locale of UTF-8?

3 Answers3