2

I'm writing a parser which parses UTF-8 strings. Characters outside of the ASCII range can only occur inside of string literals, which begin and end with ' or ". The rest of the language may only contain ASCII characters, so I can simply return an error if I find a byte outside the ASCII range.

The problem I can't seem to figure out is, when I encounter a non-ASCII character inside of a string literal, how can I detect how many bytes to skip for that character? My concern is that if a multi-byte character contains a ' or " as one of the bytes, my parser would end the string literal early.

Perhaps a shorter way to ask this is, if I encounter a byte in the 0x80-0xFF range, how can I detect how many bytes are in that character in a UTF-8 encoded string?

I'm writing this parser in C but I suspect that doesn't matter.

skomisa
  • 16,436
  • 7
  • 61
  • 102
kerkeslager
  • 1,364
  • 4
  • 17
  • 34
  • Why do you want to _skip_ the bytes? You'd need to actually decode them, won't you? And you'd also want to verify that the sequence is valid. Assuming that a supposed UTF-8 encoded untrusted string is actually valid UTF-8 is a common cause for vulnerabilities. – user17732522 May 07 '23 at 15:02
  • 1
    Does this answer your question? [How to correctly skip unicode (UTF-8) characters?](https://stackoverflow.com/questions/58046533/how-to-correctly-skip-unicode-utf-8-characters) – user17732522 May 07 '23 at 15:04
  • I understand that your question is independent of any specific programming language, but since you are using C I added a `[c]` tag to increase the audience for your question. Feel free to remove it. – skomisa May 07 '23 at 20:05
  • 1
    The first table in Wikipedia shows it: https://en.wikipedia.org/wiki/UTF-8#Encoding – Codo May 07 '23 at 21:18

1 Answers1

6

My concern is that if a multi-byte character contains a ' or " as one of the bytes, my parser would end the string literal early.

Ah, this is your misunderstanding. The brilliance of UTF-8 is that this cannot happen. In UTF-8, the byte 0x27 can only mean APOSTROPHE. It can never be part of a multi-byte sequence. This is because continuation bytes begin with the high bit set to 1.

A major design goal of UTF-8 is that existing and naïve ASCII implementations will work identically when parsing UTF-8 streams, even if the stream includes non-ASCII bytes. You can safely parse for " and continue to accumulate bytes until you reach " (and use \ to escape internal "), and never have to worry about whether there are multi-byte characters involved with UTF-8. ASCII parsers do not need to understand UTF-8 or perform any UTF-8 decoding in order to work correctly.

Beyond that, if you decide you really do want to know the answer to your question, the first byte's number of leading 1 bits tells you the length, with the exception that zero 1s is "1 byte" and one 1 is "continuation".

0x00 - 0x7F -> 1 byte
0x80 - 0xBF -> (continuation)
0xC0 - 0xDF -> 2 bytes
0xE0 - 0xEF -> 3 bytes
0xF0 - 0xF7 -> 4 bytes

You can also just keep scanning along until you find something in the range 0x00-0x7F.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Rob Napier
  • 286,113
  • 34
  • 456
  • 610