6

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.

I'm writing a C99 app that basically receives XML text encoded in UTF-8.

Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)

As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.

Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.

Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.

Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?

I did also encounter some C++ library which is very popular. but im limited for C99 implementation.

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?

Happy to hear some thoughts. thanks.

Yarel
  • 424
  • 1
  • 6
  • 15

2 Answers2

5

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.

As noted in the rationale for the C99 standard:

However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.

...

C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.

Sourced from here.

So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.

In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.

I suggest you investigate the ICU project - at least from an educational standpoint.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • Thanks alot! I digged in for some info on ICU but i couldn't find any useful examples. Should i use ICU just for converting the string, or does it have any functions for string manipulation as well? – Yarel Jan 14 '14 at 21:37
  • I suggest you start with [the ICU API](http://icu-project.org/apiref/icu4c/) to see if it meets your needs. – McDowell Jan 14 '14 at 22:10
  • As i understood, in order to work well with string manipulation functions(as explained here [link](http://www.icu-project.org/apiref/icu4c/ustring_8h.html#details))on a UTF-8 string on ICU, i would have to convert my string to UTF-16. Question is, if some of my string includes letters which uses 3-4 bytes in UTF-8, how are they "translated" to UTF-16 which uses 1-2 bytes? – Yarel Jan 15 '14 at 07:42
  • Looks like you'll have to convert your utf-8 encoded data "manually" to utf-16. You can do this, sure. You'll have to detect your utf-8 byte stream for singlebytes as well as 2,3 and 4-byte sequences. I hope you know how to decode the codepoints. For any codepoint that is a surrogate, drop it, it's illegal. For all codepoints lower than 0xFFFF you can just set the value onto your wchar (should be 16 bits wide). For codepoints higher than 0xFFFF you must create a surrogate pair. If your wchar is 32 bits wide, just transcode from utf-8 to utf-32. – brighty Jan 15 '14 at 12:22
  • By the way, UTF-16 doesn't use 1-2 bytes, it uses words. A surrogate pair is really a dword inside the word stream, it encodes codepoints higher than 0xFFFF. In surrogate pairs, the high surrogate must come first, then the low surrogate. Vice versa is illegal, also if surrogates don't appear as a pair, those are orphaned surrogates. – brighty Jan 15 '14 at 12:28
  • The numbers in the UTF schemes indicate the width of the _code units_ - 8/16/32. For example, MATHEMATICAL_FRAKTUR_CAPITAL_G U+1D50A would be represented as `F0 9D 94 8A` in UTF-8, `D835 DD0A` in UTF-16BE, and `35D8 0ADD` in UTF-16LE. How _code points_ are encoded in UTF schemes is described on the [Unicode FAQ page](http://www.unicode.org/faq/utf_bom.html). – McDowell Jan 15 '14 at 16:11
1

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers?

You could do that with conditional typedefs like this:

#if defined(__STDC_UTF_16__)
   typedef _Char16_t CHAR16;
#elif defined(_WIN32)
   typedef wchar_t   CHAR16;
#else
   typedef uint16_t  CHAR16;
#endif

#if defined(__STDC_UTF_32__)
   typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
   typedef wchar_t   CHAR32;
#else
   typedef uint32_t  CHAR32;
#endif

This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.

dan04
  • 87,747
  • 23
  • 163
  • 198