Iterate backwards through a utf8 multibyte string

Question

I use a slightly modified version of this function is_utf8 https://stackoverflow.com/a/1031773/275677 to extract UTF8 sequences from a character array, returning the sequence and how many bytes in it so that I can iterate over a string in this way.

However I would now like to iterate backwards over a string (char *). What is the best way to do this?

My guess is to try to classify the last four, three, two and one bytes of the string as utf8 (four times) and pick the longest.

However, is it ever the case that utf8 is ambigious? For example can aaaabb parsed as aaaa.bb also be parsed (backwards) as aa.aabb where aa, aaaa, bb and aabb are valid utf8 sequences?

For Windows I would use `MultiByteToWideChar()` to covert the string into Unicode and then do an easy iteration. — honk, Mar 07 '14 at 18:14

abligh · Accepted Answer · 2016-09-20T08:47:05.387

14

A string consists of a series of UTF-8 sequences. All UTF-8 sequences:

EITHER consist of exactly one octet (byte to you and me) with the top bit clear
OR consist of one octet with the two topmost bits set, followed by one or more octets with bit 7 set and bit 6 clear.

See http://en.wikipedia.org/wiki/Utf8#Description for details.

So what you need to do is to check whether the character concerned has bit 7 set and bit 6 clear, and if so step back one, taking care not to go beyond the start of the string (note that if the string is well formed, this won't happen).

Untested C-ish pseudocode:

char *
findPrevious (const char *ptr, const char *start)
{
    do
    {
        if (ptr <= start)
            return NULL; /* we're already at the start of the string */
        ptr--;
    } while ((*ptr & 0xC0) == 0x80);
    return ptr;
}

edited Sep 20 '16 at 08:47

answered Mar 07 '14 at 18:25

abligh

24,573
4
47
84

You got the second condition wrong. You mean bit 7 set and **bit 6** clear (assuming you count from 0). – R.. GitHub STOP HELPING ICE Mar 07 '14 at 19:24
@R.. doh - thanks - fixed. The pseudocode got that right I think. – abligh Mar 07 '14 at 19:28
+1 Ideas: 1) If the string is assumed to be well formed, the `(ptr <= start)` need not be in the loop, but before the loop. 2:) maybe `const char *`? – chux - Reinstate Monica Mar 07 '14 at 22:31

score 1 · Answer 2 · edited May 23 '17 at 12:10

This looks helpful. From the depths of memory I recalled that you can tell just by peaking at the first one or two bits of each byte - this makes it trivial to see how long the string should be, then use the original function just to validate it.

UTF-8 characters are either single bytes where the left-most-bit is a 0 or multiple bytes where the first byte has left-most-bit 1..10... (with the number of 1s on the left 2 or more) followed by successive bytes of the form 10... (i.e. a single 1 on the left). Assuming that your string is well-formed you can loop over all the bytes and increment your "character count" every time you see a byte that is not of the form 10... - i.e. counting only the first bytes in all UTF-8 characters.

https://stackoverflow.com/a/7108478/275677

also the table at http://en.wikipedia.org/wiki/UTF-8#Description

Iterate backwards through a utf8 multibyte string

2 Answers2