UTF8 processing in C

Question

I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.

What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.

Thanks,

For length, see e.g. http://stackoverflow.com/q/5117393/440558 — Some programmer dude, Jun 08 '12 at 11:49
Keep in mind that e.g. strlen() works perfectly well on utf-8 encoded data, it gives you the length of the uft-8 string. It does not give you the number of unicode characters in that string though. — nos, Jun 08 '12 at 11:52
some more links from stackoverflow http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c , http://stackoverflow.com/questions/4607413/c-library-to-convert-unicode-code-points-to-utf8/4609989 and to a code snippet I wrote a few week ago https://bitbucket.org/cggaertner/libtcu/raw/5ea138154ba5/utf8z.h — Christoph, Jun 08 '12 at 12:14
@nos This is wrong, in several ways. Certainly `strlen` doesn’t work at all if there are U+0000 code points in the string, which are completely legal. It is disingenuous to say that it tells the the “length” of the string. It doesn’t. It tells you the number of bytes only, and not the number of code points, which is what you would want. — tchrist, Jun 10 '12 at 02:23
@tchrist Remember that we are talking about UTF-8 encoded strings here. In C code, an UTF-8 string ends when you hit a null byte. The lenght of the UTF-8 string might or it might not be what you want. e.g. you do need the number of bytes if you're copying the string into a new buffer, or if you need to determine whether the string fits in a limited length database field. — nos, Jun 10 '12 at 08:55
@tchrist `strlen` doesn't work for ASCII strings that contain the ASCII code NUL either. But we don't go around saying it doesn't work for ASCII strings, do we? — bames53, Apr 04 '15 at 19:15
@tchrist You are also incorrect in assuming that the only length anyone cares about is the number of code points. In fact in my experience code that correctly handles Unicode data more often needs to know the amount of storage used (in either bytes or code units), or the number of characters (e.g., number of grapheme clusters) than it needs to know the number of code points. — bames53, Apr 04 '15 at 19:19
@bames53 In what way are you saying that the number of grapheme clusters is an “example” of the number of characters? That makes no sense whatsoever. — tchrist, Apr 04 '15 at 19:20
@tchrist What counts as a character can be application specific. The Unicode grapheme cluster algorithms provide a couple reasonable defaults, but applications may prefer to do text segmentation differently. — bames53, Apr 04 '15 at 19:28
@bames53 Did you mean to say “i.e. = that is” instead of “e.g. = for example”? I just don’t understand the example part. — tchrist, Apr 04 '15 at 19:29
@tchrist No, I meant "For example". The Unicode grapheme cluster algorithms are examples of a couple ways an application may care to do text segmentation. — bames53, Apr 04 '15 at 19:30
@lang2 It is perfectly well if you need the number of bytes, which is a very common use case. If you need the number of characters as would be displayed, it does not "work". However a function that gives you the number of characters/code point would not work at all if you need the number of bytes. So it is cruical that you know the diffrence in these two ways of measuring a string when you work with C code, — nos, Mar 16 '21 at 13:32

score 4 · Answer 1 · answered Jun 10 '12 at 02:06

GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.

For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.

However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.

On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.

Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is “Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them.”

score 1 · Answer 2 · answered Jun 08 '12 at 11:58

1

The foremost library for handling Unicode is IBM's ICU.

But if all you need to do is determine the number of codepoints in an UTF-8 encoded string, count the number of chars with values between \x01 and \x7F or between \xC2 and \xFF.

answered Jun 08 '12 at 11:58

Mr Lister

45,515
15
108
150

2

`\xC2` to `\xF4`, actually - Unicode stops at `U+10FFFF`. It's probably easier just to discount continuation bytes, and you can do that with a single operation: `c & \xC0 != \x80`. – ecatmur Jun 08 '12 at 12:17
Sure, or, assuming that chars are signed, `C >= '\xC2'` – Mr Lister Jun 09 '12 at 12:45
1

Also, Unicode is more than a character set. You must also account for things like _canonical equivalence_ (where you should treat a string containing, for example, `U+0178` as identical to one containing `U+0059` `U+0308` even though the first one is 2 bytes long in UTF-8 and the second one 3 bytes). But that might be outside the scope of this question. – Mr Lister Jun 09 '12 at 12:53
Code Units* a codepoint is basically a character or glyph (with asterisks, but that's the general idea) – MarcusJ Sep 27 '16 at 08:57
@Marcus Nope. In UTF-8, a code unit is an 8-bit byte. That was the whole problem! We needed to count code points rather than code units! I'm not sure what you mean by asterisks though. – Mr Lister Sep 27 '16 at 09:56
I guess you're right, I must've misread the Unicode standard. – MarcusJ Sep 27 '16 at 20:53

Grzegorz Adam Hankiewicz · Answer 3 · 2021-03-11T11:11:03.570

1

If you are interested in a library which doesn't allocate memory and uses the stack you could try utf8rewind.

edited Mar 11 '21 at 11:11

answered Apr 28 '18 at 23:31

Grzegorz Adam Hankiewicz

7,349
1
36
78

this page 404ed. – lang2 Mar 11 '21 at 05:27

UTF8 processing in C

3 Answers3

Linked