3

I am getting an XML file from the Facebook API with the data:

<?xml version="1.0" encoding="UTF-8"?> 
<fql_query_response xmlns="api.facebook.com/1.0/"; xmlns:xsi="w3.org/2001/XMLSchema-instance"; list="true"> 
    <user> 
        <uid>100000022063315</uid> 
        <name>0xD7 0x99 0xD7 0x95 0xD7 0x97 0xD7 0x90 0xD7 0x99 0x20 0xD7 0x95 0xD7 0x9B 0xD7 0x98 0xD7 0xA8</name> 
    </user>
</fql_query_response>

I want to translate the UTF-8 to wchar_t. I am trying to do so with mbstowcs but apparently I need to know what locale to set. Is there a standard locale for Facebook? or for UTF-8?

phuclv
  • 37,963
  • 15
  • 156
  • 475
chacham15
  • 13,719
  • 26
  • 104
  • 207
  • 4
    UTF-8 doesn't have a "locale". It is just an [encoding](http://www.unicode.org/reports/tr17/) for [Unicode](http://en.wikipedia.org/wiki/Unicode) (maps Unicode codepoints onto one or more bytes, often for transmission) -- what "locale" is Unicode? –  Jun 24 '11 at 15:38
  • 4
    Any locale ending in ".utf8" will do, e.g. "en_US.utf8". Say `setlocale(LC_CTYPE, "en_US.utf8");` before you do `mbsrtowcs` and it should work. Alternatively use iconv going from UTF8 to WCHAR_T. – Kerrek SB Jun 24 '11 at 15:42
  • I would set it so that wchar_t strings are UTF-16 (or UTF-32 depending) – Martin York Jun 24 '11 at 16:46
  • @Kerrek SB: That's a Linux locale. On Windows, [If you provide a code page like UTF-7 or UTF-8, `setlocale` will fail, returning `NULL`](http://msdn.microsoft.com/en-us/library/x99tb11d.aspx) – MSalters Jun 25 '11 at 14:06
  • @Martin: You cannot control the _result_ of `mbstowcs`. It's just some implementation-defined fixed-width string. @MSalters. Good point. Use iconv (from UTF8 to WCHAR_T), as I suggest below. – Kerrek SB Jun 25 '11 at 14:11

3 Answers3

5

To translate data that's not associated with the user's configured locale, but rather an explicitly specified encoding, you should use iconv, not mbsrtowcs. You don't need setlocale at all for this.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
4

As @pst notes, the terminology here is a bit wrong. "Locale" is used sometimes to refer to which ANSI code page is used to represent international text when unicode is not available.

Read Joel Spolsky's fantastic "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

Now, to answer your question, if you need to convert UTF-8 encoded text to UTF-16 (or what in Windows is commonly called "wide char") you can use a function such as MultiByteToWideChar with the parameter CP_UTF8

Assaf Lavie
  • 73,079
  • 34
  • 148
  • 203
0

Here is a little discussion I started a while ago on this subject.

Basically, I would personally distinguish two separate paths on encoding handling:

  • One is an encoding-agnostic, "internally portable" path that using mbstowcs to convert the external multibyte data from char * argv[] and convert it into an internal, fixed-width wide string, all without ever talking about encodings.

  • The other is a fixed-encoding, serializable path that deals with data that ships in deterministic encodings. To translate among those, the Posix iconv library does the trick.

  • You can bridge between the two paths by using iconv's special WCHAR_T encoding.

Since the situation that you describe requires you to read serialized, deterministic data, I would suggest using iconv to convert FROM UTF8 (which you know you have) and convert TO WCHAR_T, which you can then treat with your standard C wide string functions (but don't make assumptions about the actual encoding). If you need to print data to the console, you can always wcstombs from your internal wide strings to a multibyte representation (the details of which are again not of your concern) that the console told you it wants.

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084