I am not sure if my assumptions are correct, but I feel that all four kinds of length of a multibyte sequence can be different, to illustrate:
Say, the multibyte encoding is UTF-8, and we have the string "\xc3\xb8 \xe2\x86\x82 e\xcc\x88"
, the UTF-8 encoding of "\u00f8 \u2182 e\u0308"
, "ø ↂ ë"
This string has a length of:
- 10 bytes
- 6 unicode code-points
- 5 characters
- 6 screen positions (with a monospaced font) (ↂ takes 2 positions)
1.) is returned by strlen
and 2.) can be determined with the <wchar.h>
functions.
But is there a portable way of determining 3.) and 4.)? I am not sure, if ↂ taking two cursor positions is defined font-independently for that codepoints or something about the font in use, I feel that “monospaced font” and “some characters take more than one space” is somewhat contradictional. At least, in Monospace this character does cover two cursor positions. The Unicode chart U2150 doesn't say anything about cursor positions.
Lastly, is the number of positions negative for any character (I mean, a character putting the cursor position to the left in a left-to-right script or vice versa)?