I use a slightly modified version of this function is_utf8
https://stackoverflow.com/a/1031773/275677 to extract UTF8 sequences from a character array, returning the sequence and how many bytes in it so that I can iterate over a string in this way.
However I would now like to iterate backwards over a string (char *
). What is the best way to do this?
My guess is to try to classify the last four, three, two and one bytes of the string as utf8 (four times) and pick the longest.
However, is it ever the case that utf8 is ambigious? For example can aaaabb
parsed as aaaa.bb
also be parsed (backwards) as aa.aabb
where aa
, aaaa
, bb
and aabb
are valid utf8 sequences?