1

I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.

However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).

1 bytes: 0xxxxxxx

For the remainder I believe the pattern to be as such:

2 bytes: 110xxxxx 10xxxxxx

3 bytes: 1110xxxx 10xxxxxx 10xxxxxx

4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:

while ((check = fgetc(fp)) != EOF) {
        if (*) {
        // do something to the code
    }
}

However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).

As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:

if (check >= 0 && check <= 127) {
    // do something to the code
}

Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
NewDev90
  • 379
  • 2
  • 21
  • 2
    `if (ch&0xe0==0xc0){...one byte will follow...}` et cetera... – wildplasser Sep 12 '19 at 18:55
  • Just decode the codepoints. And then check they had the shortest possible encoding. – Deduplicator Sep 12 '19 at 18:58
  • @wildplasser Can you elaborate on the interior of $(ch&xe0==0xc0)$ I'm not entirely sure how to understand this? – NewDev90 Sep 12 '19 at 19:11
  • Literally It means `AND` the contents of `ch` with `0x11100000` and testing for equality with `0x11000000`. – ryyker Sep 12 '19 at 19:19
  • If the writer chose UTF-8 then the file contains UTF-8 encoded text. Read it with UTF-8. Simple as that. – Tom Blodget Sep 12 '19 at 23:20
  • Look at [Really Good, Bad UTF-8 Example Test Data](https://stackoverflow.com/questions/1319022/really-good-bad-utf-8-example-test-data) — there is information there about what makes code invalid as UTF-8, and if the data you're analyzing violates the rules for UTF-8 (e.g. it contains a byte 0xC0, 0xC1, 0xF5..0xFF), then it is definitively not UTF-8. There are also sequencing rules — lots of sequences of bytes are invalid as UTF-8. – Jonathan Leffler Sep 13 '19 at 00:19
  • 2
    @wildplasser — may I presume you meant `if ((ch & 0xE0) == 0xC0)`, where the issue is the extra parentheses rather than the capitalization or spacing. As it stands, the code in `{…}` will not be executed because `0xE0` does not equal `0xC0`, so the RHS of the `&` is 0, so the result of `ch & 0` is 0. – Jonathan Leffler Sep 13 '19 at 00:26
  • @JonathanLeffler you are righr, of course. I just stept ito the #dmr trap. – wildplasser Sep 13 '19 at 21:21

2 Answers2

1
if ( (ch & 0x80) == 0x0 ) {
  //ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
  // 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
 // 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
  // 4 bytes
}

You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.

jpsalm
  • 328
  • 1
  • 8
  • Thanks for the solution, after some paper work examples I clearly understand how this works. However, is there a intuitive reason why you would AND with exactly the values presented? For instance why is it the case that ((ch & 0xf8) == 0xf0) is the correct way to the check for 4-bytes etc. – NewDev90 Sep 13 '19 at 09:32
  • @NewProgrammer 4 byte case: We are trying to identify a byte with a 11110xxx pattern, i.e. 4 1s and 1 zero for a total of 5 bits. We & with 0xf8 (0b11111000) to select the upper 5 bits of our char -- the bits we are interested in. Now we have to check that of those five bits the first four are 1 and the last one is 0 so we test for equality against 0xf0 (0b11110000). The other cases are similar but with less bits checked. – jpsalm Sep 13 '19 at 14:21
0

UTF-8 is not hard, but it is stricter than what you realize and what jpsalm's answer suggests. If you want to test that it's valid UTF-8, you need to determine that it conforms to the definition, expressed in ABNF in RFC 3629:

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Alternatively, you can do a bunch of math checking for "non shortest form" and other stuff (surrogate ranges), but that's a huge pain, and highly error-prone. Almost every single implementation I've ever seen done this way, even in major widely used software, has been outright wrong on at least one thing. A state machine that accepts UTF-8 is easy to do and easy to verify that it matches the formal definition. One nice, clean, readable one is described in detail at https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Community
  • 1
  • 1
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • You've successfully demonstrated that there's more to UTF-8 than many people assume, but this doesn't seem to address the question. – Adrian McCarthy Sep 12 '19 at 22:20
  • 1
    @AdrianMcCarthy: "I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters." <-- you write a state machine that transitions on each byte read by `fgetc`. – R.. GitHub STOP HELPING ICE Sep 12 '19 at 22:55
  • I believe the rest of the question suggests that writing the state machine is the part the OP is having difficulty with. – Adrian McCarthy Sep 13 '19 at 21:23