Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv()
of ICU, though if this is the only thing you do, iconv()
is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[]
, or if you have C1x, an array of char32_t
. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a
with an accent ^
can be expressed as two unicode codepoints, or as a combined legacy codepoint â
- both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv()
first.