59

Lets say I have a string:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

theString[3] == "好"

How can I slice, and cat such strings?

Tom Zych
  • 13,329
  • 9
  • 36
  • 53
jsj
  • 9,019
  • 17
  • 58
  • 103

10 Answers10

31

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

See here for a description of the encoding and how strlen can work on a UTF-8 string.

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

  • the left sz UTF-8 bytes of a string.
  • the sz UTF-8 bytes of a string, starting at pos.
  • the rest of the UTF-8 bytes of a string, starting at pos.

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.

Community
  • 1
  • 1
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Yes it seems I have to implement a lot of this myself.. I have managed to implement a u_strlen and u_charAt in the last hour. Should be able to cut slices based on that. – jsj Sep 04 '11 at 09:47
  • Accepted because I did end up writing my own functions. – jsj Sep 04 '11 at 15:57
  • 3
    Note: this ignores grapheme clusters described in [UAX#29](http://www.unicode.org/reports/tr29/), i.e. "नि" is supposed to be seen as a single unit of text, but will give a length of 2 with the method in this answer. – AliciaBytes Nov 02 '16 at 20:42
20

Try this for size:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

Sample run:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

Note that your example has an off by one error. theString[2] == "好"

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
  • by any chance do you know of any implementation of strlen() for combining characters ? like 'a' with accent for example, should return 1 , not 2 – Nulik Sep 26 '16 at 16:38
  • @Nulik: That sounds utf8len, utf8len("ā") should return 1. – Matt Joiner Sep 27 '16 at 03:44
  • Are you sure the example in the question has an off by one error? 好 is two bytes long, but defining a string like that always adds a null character at the end, so 3 is correct, I believe. – iFreilicht Aug 21 '20 at 13:06
  • Does this code cover all valid UTF8 or just a subset?? –  Feb 09 '21 at 19:50
  • @RichardMcFriendOluwamuyiwa i believe it should work on all utf8 – Matt Joiner Feb 11 '21 at 10:59
17

The easiest way is to use a library like ICU

mmmmmm
  • 32,227
  • 27
  • 88
  • 117
  • 2
    @Mark.. I asked a couple of questions about ICU. People mostly replied that it was unnecessary for simple operations. http://stackoverflow.com/questions/7294447/how-to-get-started-with-icu-in-c – jsj Sep 04 '11 at 08:29
  • 6
    @trideceth12: in many cases, you actually want to access grapheme clusters, not characters; and implementing that from scratch is far more involved than just decoding UTF-8, so using a library might be a good idea – Christoph Sep 04 '11 at 09:05
  • @Christoph: Indeed so! And the ICU regex library support full Unicode extended grapheme clusters via the `\X`, making these things easy. That said, there are chunks of C code that do it all for themselves, like `vim` — however, that seems to use something more like `\PM\pM*`, and also is stuck working only on the BMP. Sigh. – tchrist Sep 06 '11 at 18:24
9

Depending on your notion of "character", this question can get more or less involved.

First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.

Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.

However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.

That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.

Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
3

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.

Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.

For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.

If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).

Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • @R The use is converting pinyin in numeral form (ni2hao3ma5) into pinyin with accents.. I have written my own functions now, based on the inherent meaning in the first byte of a unicode charpoint. It's a bit clunky but it does the job without the need to include a heavy library. – jsj Sep 04 '11 at 15:56
  • @trideceth12: I did that same thing myself one. It was just a couple of lines a Perl. Really. – tchrist Sep 06 '11 at 18:26
  • I would argue that you almost never want to know how much "storage" there is, and what you really want when you're talking length is "characters", not bytes. Look at string processing, you're code would be broken on UTF8/UTF16 if you cannot answer queries like length in terms of graphemes. If you do not care about Unicode, and encode things in ASCII or UTF-32, then yes, maybe it's irrelevant for you. –  May 24 '14 at 01:33
  • Graphemes or characters are only relevant to visual display (and sometimes, editing). That's 1% of what you do with strings, and usually isolated to GUI toolkit libraries. Everything else done with strings is completely agnostic and only cares (on C, where storage is explicit) about the storage requirements for the string. In other languages where storage is not explicit, you shouldn't even care about that. – R.. GitHub STOP HELPING ICE May 24 '14 at 14:37
1
while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)

However I'm still stumped on slicing and concatenating?!?

jsj
  • 9,019
  • 17
  • 58
  • 103
  • 3
    You really, really do want to use a wide string type. This is simply not an application where you can put a premium on conserving memory. We're talking about bytes on systems that have gigabytes to go around, anyway. You don't have random-access to characters in a UTF-8 encoding. UTF-8 is better suited as a storage/serialization format. But just FWIW, concatenation works "directly", as long as you don't have to worry about BOMs; treat the bytes as bytes. "slicing" needs to be better defined. – Karl Knechtel Sep 04 '11 at 08:49
  • Slicing and concatenating would then be just a search operation, surely? Linear search in the most obvious implementation. I'm with those that don't see any real benefit in avoiding wchar_t though, to be honest. – Tommy Sep 04 '11 at 08:50
  • 6
    @Karl: taking grapheme clusters into account, even UTF-32 often has to be treated as a variable-length coding... – Christoph Sep 04 '11 at 09:10
1

In general we should use a different data type for unicode characters.

For example, you can use the wide char data type

wchar_t theString[] = L"你们好āa";

Note the L modifier that tells that the string is composed of wide chars.

The length of that string can be calculated using the wcslen function, which behaves like strlen.

abahgat
  • 13,360
  • 9
  • 35
  • 42
  • Except that wide chars are all 4 bytes each.. so "hello world" is 44 bytes instead of 11 bytes, and "大家,你们好" is 24 bytes instead of 18 bytes. – jsj Sep 04 '11 at 08:40
  • 1
    Well, that is generally left to the implementation (in some cases they can be 2 byte long), but I can see your point here. – abahgat Sep 04 '11 at 08:45
  • @abahgat: that `wchar_t` doesn't necessarily use UTF-32 (ie the 2-byte case) makes this solution unportable... – Christoph Sep 04 '11 at 09:07
  • 4
    summary: wchar_t is **NOT** Unicode, because sizeof(wchar_t) is is compiler-dependent – user411313 Sep 04 '11 at 11:03
  • 1
    @user411312, it can be used for storing unicode characters, but the encoding is an implementation detail, note that the unicode character set is not fixed to any encoding – Sebastian Sep 04 '11 at 11:28
  • 2
    @user411312 wchar_t is UTF-32 for GCC (at least on unixoid systems) and UTF-16 on windows/msvc - so for the most popular systems wchar_t _is_ (some) Unicode – mbx Sep 04 '11 at 11:34
1

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).

This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

Steve Dispensa
  • 237
  • 1
  • 2
0

I did similar implementation years back. But I do not have code with me.

For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.

I think its a good UTF8 library. enter link description here

Senthil
  • 106
  • 1
  • 9
-1

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)

So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.

So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with. For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

You need to parse the string and look for the above patterns to break the string and to find the substrings.

I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;

I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

SRKJ
  • 29
  • 5