4

i want to know is there a simple way to determine the number of characters in UTF8 string. For example, in windows it can be done by:

  1. converting UTF8 string to wchar_t string
  2. use wcslen function and get result

But I need more simpler and crossplatform solution.

Thanks in advance.

Taryn East
  • 27,486
  • 9
  • 86
  • 108
akmal
  • 89
  • 4
  • Use a cross-platform library. Like ICU. Beware of the difference between characters (returned by wcslen) and codepoints. – Hans Passant Aug 18 '11 at 14:07
  • I need function which will give me the length of the UTF8 string, so I think it is not good idea to add whole lib for using only one function. – akmal Aug 18 '11 at 14:15
  • Well, copy and paste the code then. It is open source, you can do with it what you want. I doubt you'll enjoy the typical macro soup needed to do things cross-platform. – Hans Passant Aug 18 '11 at 14:27
  • Have you considered using a language and/or library that provides robust Unicode support? Rolling your own seems like a recipe for disaster. – tchrist Aug 19 '11 at 02:45
  • @tchrist, no, I assume that all incoming paths will be in UTF8 format, and within library I'm using string function(string.h in c) to manipulate this paths. And one thing I want to check is path's length. That's why I asked this question. Is my approach is rigth or if there is better solution please let me know. – akmal Aug 19 '11 at 04:27

3 Answers3

5

UTF-8 characters are either single bytes where the left-most-bit is a 0 or multiple bytes where the first byte has left-most-bit 1..10... (with the number of 1s on the left 2 or more) followed by successive bytes of the form 10... (i.e. a single 1 on the left). Assuming that your string is well-formed you can loop over all the bytes and increment your "character count" every time you see a byte that is not of the form 10... - i.e. counting only the first bytes in all UTF-8 characters.

borrible
  • 17,120
  • 7
  • 53
  • 75
  • Great idea! Thanks;). But is it the only way? – akmal Aug 18 '11 at 13:52
  • @akmal - I am sure that there are alternative solutions, but this is pretty straightforward and crossplatform as requested. You tagged the question C and the original C standards pre-date the unicode standards by several years. As such, there are no built-in C functions for unicode across all versions of C. – borrible Aug 18 '11 at 13:55
  • Your solution is a bit confusing for me.. are you talking about the BOM? If not, you can't deduce whether a character takes 1, 2 or more bytes... and.. your bytes and left-most-bytes stuff is even more confusing...1 byte = 8 bits, right? – duedl0r Aug 18 '11 at 15:01
  • 1
    @duedl0r - the first byte (8 bits) of a UTF-8 sequence tells you the number of bytes in the entire sequence. As such, by only counting those bytes you can tell the number of characters (characters not bytes) in the stream. – borrible Aug 19 '11 at 10:03
  • @borrible: this sounds strange to me, the number of bytes of a sequence in the first byte??? do you have a link which supports this claim? – duedl0r Aug 19 '11 at 11:53
  • @duedl0r - The definition of UTF-8 is very clear on this point. Take a look, for example, at the "design" section of the wikipedia page http://en.wikipedia.org/wiki/UTF-8 or just read the unicode standard. – borrible Aug 19 '11 at 11:55
  • @borrible: I see what you mean, then again: 1 byte = 8 bits. You talk about left-most-byte but mean left-most-bit. It's a bit confusing.. – duedl0r Aug 19 '11 at 12:01
  • @duedl0r - Ah, I see where the confusion has arisen. I've edited the answer to clarify. – borrible Aug 19 '11 at 12:03
4

The entire concept of a "number of characters" does not really apply to Unicode, as codes do not map 1:1 to glyphs. The method proposed by @borrible is fine if you want to establish storage requirements in uncompressed form, but that is all that it can tell you.

For example, there are code points like the "zero width space", which do not take up space on the screen when rendered, but occupy a code point, or modifiers for diacritics or vowels. So any statistic would have to be specific to the concrete application.

A proper Unicode renderer will have a function that can tell you how many pixels will be used for rendering a string if that information is what you're after.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
  • I will use this function to determine is path length is correct or not. I assume that possible path's length in windows is MAX_PATH characters(not bytes), on linux PATH_MAX characters, so is it correct? – akmal Aug 18 '11 at 14:09
  • 1
    @akmal: that comes under the heading of "storage requirements in uncompressed form", since a Windows filename doesn't care about combining characters or even surrogate pairs, it's really just a series of 16-bit values. But beware that if there are characters outside the BMP in your UTF-8 input, then `MAX_PATH` Unicode code points would convert to more than `MAX_PATH` TCHAR "characters". Posix `PATH_MAX` is the number of chars (bytes), not the number of any kind of Unicode character. – Steve Jessop Aug 18 '11 at 14:55
  • If you are doing this to check against path lengths, I'm not sure it's a good idea. The only purpose of `PATH_MAX` is to ensure the kernel does not have to deal with unboundedly large path searches, which could be a DoS attack. Rather than trying to enforce `PATH_MAX` in your application, why not just let the kernel reject paths it doesn't like?? – R.. GitHub STOP HELPING ICE Aug 18 '11 at 14:56
  • Wrt the PATH_MAX issue, see also http://stackoverflow.com/questions/7106911/problem-with-handling-path-length (by the same user, it appears). – janneb Aug 18 '11 at 15:06
  • @R.. Well, but will it work correctly, for example, if path lenght will be about 500 characters, while MAX_PATH(in windows) is 260 characters. – akmal Aug 19 '11 at 04:22
  • @akmal: the Windows documentation is infuriating, it says something like "the file name is limited to MAX_PATH characters", when it should say one of (1) "if the file name is longer than MAX_PATH, then behavior is undefined", (2) "if longer than MAX_PATH, then an error is returned", (3) "if longer than MAX_PATH, then an error may be returned otherwise the operation proceeds as normal". I believe the actual behaviour of the Windows file functions is (3), in that normally you get an error but you can use longer unicode file names with the magic \\?\ prefix. – Steve Jessop Aug 19 '11 at 09:18
  • “Number of characters” is meaningful in Unicode; it just doesn't mean what most people might naïvely expect it to mean, and is less useful than they might think. – Stuart Cook Aug 21 '11 at 11:13
  • @StuartCook No, “number of characters” is not meaningful in Unicode. You can programatically count code points, several of which may be needed to compose an extended grapheme cluster; and you can programatically count extended grapheme clusters, which comprise one or more code points. But that’s it; those are the two sorts of units your program can chop things up into. There is no “character” that is usefully and rigorously defined in any portion of the Unicode Standard that I am aware of. Please cite a reference if you think otherwise. – tchrist Nov 16 '11 at 03:14
2

If the string is known to be valid UTF-8, simply take the length of the string in bytes, excluding bytes whose values are in the range 0x80-0xbf:

size_t i, cnt;
for (cnt=i=0; s[i]; i++) if (s[i]<0x80 || s[i]>0xbf) cnt++;

Note that s must point to an array of unsigned char in order for the comparisons to work.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711