I'm writing a parser which parses UTF-8 strings. Characters outside of the ASCII range can only occur inside of string literals, which begin and end with '
or "
. The rest of the language may only contain ASCII characters, so I can simply return an error if I find a byte outside the ASCII range.
The problem I can't seem to figure out is, when I encounter a non-ASCII character inside of a string literal, how can I detect how many bytes to skip for that character? My concern is that if a multi-byte character contains a '
or "
as one of the bytes, my parser would end the string literal early.
Perhaps a shorter way to ask this is, if I encounter a byte in the 0x80
-0xFF
range, how can I detect how many bytes are in that character in a UTF-8 encoded string?
I'm writing this parser in C but I suspect that doesn't matter.