5

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.

EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217
  • 3
    Do you want to count characters or codepoints? Based on your edit, it also sounds like you're going to care about normalization as well. All the answers (as of this writing) are with regards to counting codepoints. – Logan Capaldo Aug 27 '10 at 18:16
  • @Logan: What do you mean by "normalization"? – Dervin Thunk Aug 27 '10 at 18:18
  • 2
    Logan is right. Link: http://unicode.org/reports/tr15/ – Hans Passant Aug 27 '10 at 18:35
  • 3
    @Dervin Thunk It's possible to "spell" the same logical character multiple ways. For example, there may be a single codepoint for an "latin lowercase a with accent", or you can use multiple codepoints using the idea of combining characters "latin lowercase a", "with an accent". Normalization is the idea that you pick one of these two (or possibly more) representations of the character as the canonical one and before you start counting you go through and make sure all a accents in your string use the single representation. – Logan Capaldo Aug 27 '10 at 18:48
  • related: http://stackoverflow.com/questions/1206690/how-to-print-the-unicode-characters-in-hexadecimal-codes-in-c and http://stackoverflow.com/questions/3579557/counting-characters-again and http://stackoverflow.com/questions/55641 /unicode-in-c and http://stackoverflow.com/questions/114611/what-is-the-best-unicode-library-for-c and http://stackoverflow.com/questions/2327953/unicode-generally-working-with-it-in-c –  Aug 27 '10 at 18:56
  • Thanks, Logan. You're absolutely right. I think my best bet is to go with the ICU library. – Dervin Thunk Aug 27 '10 at 19:11

5 Answers5

11

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).

Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.

The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).

Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

Community
  • 1
  • 1
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
8

If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.

while (*p != 0)
{
    if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
        ++count;
    ++p;
}

Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:

while (*p != 0)
{
    if ((*p & 0xc0) != 0x80)
        ++count;
    ++p;
}

Or if you want to be super clever and make it a 2-liner:

for (p; *p != 0; ++p)
    count += ((*p & 0xc0) != 0x80);

The Wikipedia page for UTF-8 clearly shows the patterns.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
3

A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

bright
  • 4,700
  • 1
  • 34
  • 59
3

I know, it's late for this thread but, it could help

with ICU stuff, I did it like this:

string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;
Overnuts
  • 783
  • 5
  • 17
0

I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.

open file
until eof
    if file.readchar & 0xC0 != 0x80
        increment count
close file
P Daddy
  • 28,912
  • 9
  • 68
  • 92