5

Does it make any sense to store UTF-16 encoded text using wchar_t* on Linux? The obvious problem is that wchar_t is four bytes on Linux and UTF-16 takes usually two (or sometimes two groups of two) bytes per character.

I'm trying to use a third-party library that does exactly that and it seems very confusing. Looks like things are messed up because on Windows wchar_t is two bytes, but I just want to double check since it's a pretty expensive commercial library and may be I just don't understand something.

dda
  • 6,030
  • 2
  • 25
  • 34
  • 1
    You **can** store a 2-byte long value in a 4-byte long variable... –  Oct 12 '12 at 19:11
  • I think you need to read and understand [this](http://www.joelonsoftware.com/articles/Unicode.html) – Ozair Kafray Oct 12 '12 at 19:13
  • Is there a reason not to use a `uint16_t` to represent a UTF-16 code unit? – Mike Samuel Oct 12 '12 at 19:15
  • did you check you http://www.gnu.org/software/libiconv/ ? – Rudolf Mühlbauer Oct 12 '12 at 19:15
  • I understand that technically it's possible but it seems ugly, and seeing it in a commercial library raises doubts in my understanding of unicode so I wanted to ask somebody if it makes sense. Also I know about iconv(), will probably just convert it to UTF-8.. –  Oct 12 '12 at 19:23
  • The natural C type for UTF16 is `char16_t`. – Kerrek SB Oct 12 '12 at 19:34
  • 2
    wchar_t strings are supposed to use the implementation defined wide character encoding. If you have code that assumes that encoding is something it's not (e.g., UTF-16 on Linux) there can be problems when that code tries to interoperate with other code that treats wchar_t correctly. For example, iconv will not correctly convert between UTF-16-in-4-byte-wchar_t and UTF-8. – bames53 Oct 12 '12 at 20:42
  • 2
    I think you should have a good look at http://utf8everywhere.org, if you are into writing portable code. – Pavel Radzivilovsky Oct 13 '12 at 10:10

4 Answers4

7

While it's possible to store UTF-16 in wchar_t, such wchar_t values (or arrays of them used as strings) are not suitable for use with any of the standard functions which take wchar_t or pointers to wchar_t strings. As such, to answer your initial question of "Does it make sense...?", I would reply with a definitive no. You could use uint16_t for this purpose of course, or the C11 char16_t if it's available, though I fail to see any reason why the latter would be preferable unless you're also going to use the C11 functions for processing it (and they don't seem to be implemented yet).

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
3

http://userguide.icu-project.org/strings says

The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU.

So if you use ICU, then you can use UChar*. If not, uint16_t will make the transition easier should you ever want to interoperate with UChar.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
1

Well, the best solution is probably to use char16_t for UTF-16, since that's the standard 16-bit character type. This has been supported since gcc 4.4, so should be present on most Linux systems you'll see.

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
0

No, it makes sense to decode the UTF-16 and store it in an array of wchar_t. Not all Unicode code points have exactly one 16-bit word in UTF-16, but they all fit in a wchar_t.

In any case, UTF-16 is a worse compromise than anything else, and should never be used. Either use UTF-8 (which is more efficient in most cases, and more commonly used), or use wchar_t[].

MarkR
  • 62,604
  • 14
  • 116
  • 151
  • 3
    The OP says "on Windows wchar_t is two bytes" so cannot fit a supplemental codepoint which suggests that "they all fit in a wchar_t" is not the case. I agree that UTF-16 is a poor choice for internal representation -- it has the downsides of both UTF-8 (more complicated iteration) and UTF-32 (size-bloat), but it is the standard for things like ICU so one can make a library interop case for it. – Mike Samuel Oct 13 '12 at 17:29