-2

Apart from storage size differences, what are the differences between using wchar_t (2-byte or 4-byte) and using UTF-8 encoding for text processing programming oriented to non-Western languages?

When using wchar_t, one can use wide versions of string functions in C or C++ libraries in the same way and easiness as non-wide ones. Are there some issues with UTF-8 which add some additional processing to strings with non-western text compared to using wide versions of standard string functions?

Al Berger
  • 1,048
  • 14
  • 35
  • 1
    Related: There's an incompatibility of the MS wide printf/scanf functions to the standard. Also, your question is quite tendentious. If anything, I would suggest you go for UTF-8 all the way, and convert only on calling MS APIs. – Deduplicator May 11 '14 at 13:11
  • 1
    Note: on linux all normal string functions work with utf-8 by default without problems, there are only problems when you talk about windows programs. – AliciaBytes May 11 '14 at 13:19
  • @Deduplicator if you're just passing strings through I would agree with you. However if you need to do actual string manipulation (particularly taking into account whole code points) then `char32_t` is a preferable serialization as it is guaranteed to always be one code point per unit. I would still output in UTF-8 – Mgetz May 11 '14 at 13:19
  • @Mgetz: "Serialization"? I think you meant internal processing format, and I tend to disagree as it is bigger, and you have to account for composition anyway. – Deduplicator May 11 '14 at 13:22
  • 1
    @Mgetz actually I'd prefer utf-8 especially for serialization because you don't have to care about endianness. A code point being the same as a code unit is a weak point in my opinion since that most of the time ain't enough either and you'd have to work on grapheme clusters. – AliciaBytes May 11 '14 at 13:23
  • @Deduplicator UTF-32 has no surrogate pairs, when doing text layout that is vastly preferable – Mgetz May 11 '14 at 13:23
  • @Mgetz: Neither has UTF-8, what's your point? – Deduplicator May 11 '14 at 13:24
  • @Mgetz - This is the point of my question: what inconvenience the variable length of UTF-8 characters adds, apart from inability to reference string characters with [] operator? – Al Berger May 11 '14 at 13:24
  • @Mgetz we aren't talking about surrogate pairs but rather multiple code points making up one visible character, for extreme examples look up some zalgo text, etc. – AliciaBytes May 11 '14 at 13:25
  • @AlBerger 99% of the time nothing, there are very special cases (usually text layout) that UTF-32 makes a tiny bit easier. However `strchr` doesn't work for east asian text for example. (nor does it work with a `char16_t` version of `wchar_t`). – Mgetz May 11 '14 at 13:25
  • @AlBerger: The problem is one of terminology if you speak of `characters`, and if you define it as codepoint is vastly overrated. Going for `wchar_t` or always UTF-16 doesn't buy you anything in the indexing departement anyway. – Deduplicator May 11 '14 at 13:26
  • @Mgetz: It makes it a tiny bit easier, by making it extremely easy to neglect the corner-cases and – Deduplicator May 11 '14 at 13:28
  • @Mgetz I usually think about it differently, I agree that it mostly (99%) doesn't buy you something, but when it matters (text layout) utf-32 isn't enogh either and you'd need to work with grapheme clusters. – AliciaBytes May 11 '14 at 13:28
  • @Mgetz for my point, look here: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work utf-32 won't be enough and you have to work on grapheme clusters to really make it work correctly – AliciaBytes May 11 '14 at 13:30
  • @RaphaelMiedl Agreed, however unicode normalization can help there a bit. However you still can't use `strchr` to find 'PILE OF POO' (U+1F4A9). – Mgetz May 11 '14 at 13:32
  • @Raphael - "all normal string functions work with utf-8 by default without problems" - what was the reason for wchar_t? Only that characters in some eastern languages take in UTF-8 3 bytes instead of 2? – Al Berger May 11 '14 at 13:32
  • 1
    So, the best guidance is this: Go for [UTF-8 everywhere](http://www.utf8everywhere.org/) and convert only on call to MS APIs, though you might take UTF-16 as glue between them if you do negligible intermediate processing. – Deduplicator May 11 '14 at 13:32
  • @AlBerger when wchar_t were introduced utf-16 could basically still hold all unicode characters and utf-32/utf-8 weren't invented or common yet afaik – AliciaBytes May 11 '14 at 13:34
  • @AlBerger: When people thought 64K are quite enough, thank you very much, and Unicode would always be UCS-2 (UTF-8 was invented shortly after), it made some kind of sense to design for 16-bit unicode. The rationale though broke down when unicode outgrew 16 bits, and going for full-codepoint units by now is just wasteful. – Deduplicator May 11 '14 at 13:35
  • 1
    UTF-8 is an encoding, `wchar_t` is a data type. They're not really comparable. On Windows, `wchar_t` is typically used to represent UTF-16 or UCS2 data, and on *nix'es, it is typically used for UTF-32 data. – jalf May 11 '14 at 13:35
  • Aside: Using UTF-16 for `wchar_t` breaks the standard, because it cannot represent all unicode codepoints in one unit ;-) – Deduplicator May 11 '14 at 13:36
  • @AlBerger Afaik windows also uses utf-16 since it could hold all character back then and was introduced 1 year before utf-8 was invented, got that from the comments of a block post of eric lippert about c#/.net encoding, won't look it up at the moment though since I'm writing from my phone. – AliciaBytes May 11 '14 at 13:38
  • Raphael: For this Q&A explicitly naming the concept you mean with "character" would be vastly preferable. – Deduplicator May 11 '14 at 13:52
  • UTF-8 is a good choice for *external* text representation. For internal representation in Windows it's ungood. First, console input of UTF-8 is not supported at the API level (although output works). Secondly, the Visual C++ runtime library doesn't support UTF-8. Third, Visual C++ does not support UTF-8 literals. Add to that the slight inefficiency and not-so-slight inconvenience and verbosity of interfacing with UTF-16 based libraries including the API, plus the high chance of some code confusing UTF-8 and Windows ANSI, and a choice of UTF-8 for internal representation is heading for problems – Cheers and hth. - Alf May 11 '14 at 14:11
  • The `wchar.h` APIs were codified at a time when we didn't really know how to do i18n, and it shows. – zwol May 11 '14 at 14:46
  • @Cheersandhth.-Alf: Cannot concurr for internal representation, the only point against is the conversion on interfacing windows APIs and that you have to avoid broken MS libraries (wide `printf`/`scanf` is problematic there too), which is outweighed far by being able to use external data (in files/over sockets/from other programs) directly. All but UTF-8 really fail when going for editing, because while UTF-8 can represent any broken UTF-16, the other way is just no-go. – Deduplicator May 11 '14 at 15:13

1 Answers1

3

Assuming the library functions work for UTF-8 (this is not true for Windows generally), then there's no real problem as long as you actually USE library functions. However, if you write code that manually interprets individual elements in a string array, you need to write code that takes into account that a code-point is more than a single byte in UTF-8 - particularly when dealing with non-English characters (including for example German/Scandinavian characters such as 'ä', 'ö', 'ü'). And even with 16-bit per entry, you can find situations where one code-point takes up 2 16-bit entries.

If you don't take this into account, the separate parts can "confuse" processing, e.g. recognise things in the middle of a code-point as having a different meaning than being the middle of something.

The variable length of a code-point leads to all sorts of interesting effects on for example string lengths and substrings - where the length in is in number of elements of the array holding the string, which can be quite different from the number of code-points.

Whichever encoding is used, there are further complications with for example Arabic languages, where individual characters need to be chained together. This is of course only important when actually drawing characters, but is worth at least bearing in mind.

Terminology (for my writings!):

Character = A letter/symbol such that can be displayed on screen.

Code-point = representation of a character in a string, may be one or more elements in a string array.

String array = the storage for a string, consists of elements of a fixed size (e.g. 8 bits, 16 bits, 32 bits, 64 bits)

String Element = One unit of a string array.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • 1
    Please remember that "character" has many meanings, e.g: byte, codeunit, codepoint, grapheme, grapheme-cluster. For this Q&A, they really should be differentiated explicitly. – Deduplicator May 11 '14 at 13:41
  • Yes, good point. I will try to clarify (as best as a I can ;) ) – Mats Petersson May 11 '14 at 13:44
  • I hate to nitpick (actually I don't), but the symbol to be displayed on screen is a glyph, not a character. ;) – jalf May 11 '14 at 13:59