ANSI C UTF-8 problem

Question

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).

After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.

In utf-8:

strlen(s): always counts the number of bytes.
mbstowcs(NULL,s,0): The number of characters can be counted.

But I have some problems when I want to random access of elements(characters) of a utf-8 string.

In ASCII encoding:

char get_char(char* assci_str, int n)
{
  // It is very FAST.
  return assci_str[n];
}

In UTF-16/32 encoding:

wchar_t get_char(wchar_t* wstr, int n)
{
  // It is very FAST.
  return wstr[n];
}

And here my problem in UTF-8 encoding:

// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
  // I can found Nth character of string by using for.
  // But it is too slow.
  // What is the best way?
}

Thanks.

Do you have an example of a case where there's any use for "Nth character"? — R.. GitHub STOP HELPING ICE, Jun 29 '11 at 00:09
`mbstowcs` does not guarantee to do what you claim. It depends on your locale settings, see ``, and is generally encoding-agnostic. Use `iconv` or something like that if you handle definite encodings. — Kerrek SB, Jun 29 '11 at 00:18
@kerrek: I should use ANSI C. I don't want to use any non standard headers. — Amir Saniyan, Jun 29 '11 at 00:23
@Amir: Can you loop on a pointer instead of the character index? For example, you can do: for(p=str; *p!=NULL; move_one_char_forward(&p)) {...} — Todd Li, Jun 29 '11 at 00:23
@Amir: ANSI C is not encoding aware. Your question explicitly demands Unicode, so the only two answers are a) write your own complete Unicode library in ANSI C, or b) take an existing, extremely wide-spread and popular POSIX-conforming library. — Kerrek SB, Jun 29 '11 at 00:24

score 8 · Accepted Answer · answered Jun 29 '11 at 00:22

8

Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.

What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.

Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.

By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)

answered Jun 29 '11 at 00:22

Kerrek SB

464,522
92
875
1,084

Yes you are right, It is better to using a fixed-size character instead of using utf-8 encoding. Now I want to know which type is better for UNICODE strings? whar_t or uint32_t? My answer is wchar_t. But it is correct or a wrong selection? – Amir Saniyan Jun 29 '11 at 00:40
1

Wrong. Use `uint32_t`. Your `wchar_t` doesn't come with any size guarantees. Check out [my recent rant](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) if you're curious about this subject in general. – Kerrek SB Jun 29 '11 at 00:41
@Kerrek SB: in the standard, `wchar_t` comes with a very clear size guarantee: to be able to store any character of the execution character set. That some C libraries are broken and don't provide this guarantee is a different story. – ninjalj Jun 29 '11 at 00:48
@Amir: In short, because it's broken. Long answer is that MS uses UTF-16 internally. This is somewhat silly because UTF16 is also a multibyte encoding, just like UTF8, so it's questionable how that helps. The historic reason is that when MS fixed their standards, Unicode had fewer than 65000 registered codepoints so everyone thought that "16 bits are enough". I think that was in 1999 ;-) – Kerrek SB Jun 29 '11 at 00:51
@ninjalji: OK, I stand corrected, the standard doesn't come with an _absolute_ size guarantee. Nitpicker! :-) If my execution character set consists of only 300 characters, I could have a conforming implementation with a 9-bit wchar. – Kerrek SB Jun 29 '11 at 00:52
@Kerrek: "à" = 1 character, 1 grapheme cluster. "a"+"`" = 2 characters, 1 grapheme cluster – ninjalj Jun 29 '11 at 00:52
@Kerrek SB: or more usefully, if you have an embedded device, and you only want to support english text, you can make `char` = `wchar_t`. See, C tries to be useful in every circumstance. – ninjalj Jun 29 '11 at 00:54
1

@ninjalji: Thanks, my point exactly! Now mix in normalization and tell me what the answer to "how many characters" should be in any meaningful textual data processing model. It's really a pretty high-level question. Is a zero-width joiner a character? – Kerrek SB Jun 29 '11 at 00:55
@ninjali: Very, and we love C! I for one am very happy that the standard does *not* say anything about encodings but still acknowledges that "char" was a misnomer that should have been called "byte" and makes up with a genuine character type. My main beef is with the silly way Windows handles things, but that's not C's fault. – Kerrek SB Jun 29 '11 at 00:56
@Kerrek: ZWNJ is quite explicitly a non-character. – ninjalj Jun 29 '11 at 01:08
@ninjalj: not according to unicode terminology: ZWNJ is a character, just a non-printing one; only codepoints with special meaning (surrogates, explicit non-characters) are not considered characters – Christoph Jun 29 '11 at 05:48

score 4 · Answer 2 · answered Jun 29 '11 at 00:21

4

You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.

By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.

answered Jun 29 '11 at 00:21

Todd Li

3,209
21
19

UTF-32 is useful when you need to deal with individual characters. Most often, you don't care, and just want to move strings back and forth, which is why UTF-8 is so extended. – ninjalj Jun 29 '11 at 00:50

score 1 · Answer 3 · answered Jun 29 '11 at 01:05

What do you want to count? As Kerrek SB has noted, you can have decomposed glyphs, i.e. "é" can be represented as a single character (LATIN SMALL LETTER E WITH ACUTE U+00E9), or as two characters (LATIN SMALL LETER E U+0065 COMBINING ACUTE ACCENT U+0301). Unicode has composed and decomposed normalization forms.

What you are probably interested in counting is not characters, but grapheme clusters. You need some higher level library to deal with this, and to deal with normalization forms, and proper (locale-dependent) collation, proper line-breaking, proper case-folding (e.g. german ß->SS) proper bidi support, etc... Real I18N is complex.

And decent discussions of Unicode use "code point" where you might traditionally use "character" for exactly this reason: Historical baggage mean that "character" is too ambiguous when you want to distinguish between graphemes/glyphs/grapheme clusters/ligatures/... — tc., Jun 29 '11 at 01:50

score 0 · Answer 4 · answered Jun 29 '11 at 06:03

Contrary to what others have said, I don' really see a benefit in using UTF-32 instead of UTF-8: When processing text, grapheme clusters (or 'user-perceived characters') are far more useful than Unicode characters (ie raw codepoints), so even UTF-32 has to be treated as a variable-length coding.

If you do not want to use a dedicated library, I suggest using UTF-8 as on-disk, endian-agnostic representation and modified UTF-8 (which differs from UTF-8 by encoding the zero character as a two-byte sequence) as in-memory representation compatible with ASCIIZ.

The necessary information for splitting strings into grapheme clusters can be found in annex 29 and the character database.

ANSI C UTF-8 problem

4 Answers4