How can I determine if a file contains UTF-8 like characters

Question

I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.

However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).

1 bytes: 0xxxxxxx

For the remainder I believe the pattern to be as such:

2 bytes: 110xxxxx 10xxxxxx

3 bytes: 1110xxxx 10xxxxxx 10xxxxxx

4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:

while ((check = fgetc(fp)) != EOF) {
        if (*) {
        // do something to the code
    }
}

However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).

As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:

if (check >= 0 && check <= 127) {
    // do something to the code
}

Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?

`if (ch&0xe0==0xc0){...one byte will follow...}` et cetera... — wildplasser, Sep 12 '19 at 18:55
Just decode the codepoints. And then check they had the shortest possible encoding. — Deduplicator, Sep 12 '19 at 18:58
@wildplasser Can you elaborate on the interior of $(ch&xe0==0xc0)$ I'm not entirely sure how to understand this? — NewDev90, Sep 12 '19 at 19:11
Literally It means `AND` the contents of `ch` with `0x11100000` and testing for equality with `0x11000000`. — ryyker, Sep 12 '19 at 19:19
If the writer chose UTF-8 then the file contains UTF-8 encoded text. Read it with UTF-8. Simple as that. — Tom Blodget, Sep 12 '19 at 23:20
Look at [Really Good, Bad UTF-8 Example Test Data](https://stackoverflow.com/questions/1319022/really-good-bad-utf-8-example-test-data) — there is information there about what makes code invalid as UTF-8, and if the data you're analyzing violates the rules for UTF-8 (e.g. it contains a byte 0xC0, 0xC1, 0xF5..0xFF), then it is definitively not UTF-8. There are also sequencing rules — lots of sequences of bytes are invalid as UTF-8. — Jonathan Leffler, Sep 13 '19 at 00:19
@wildplasser — may I presume you meant `if ((ch & 0xE0) == 0xC0)`, where the issue is the extra parentheses rather than the capitalization or spacing. As it stands, the code in `{…}` will not be executed because `0xE0` does not equal `0xC0`, so the RHS of the `&` is 0, so the result of `ch & 0` is 0. — Jonathan Leffler, Sep 13 '19 at 00:26
@JonathanLeffler you are righr, of course. I just stept ito the #dmr trap. — wildplasser, Sep 13 '19 at 21:21

jpsalm · Accepted Answer · 2019-09-13T13:43:41.310

1

if ( (ch & 0x80) == 0x0 ) {
  //ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
  // 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
 // 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
  // 4 bytes
}

You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.

edited Sep 13 '19 at 13:43

answered Sep 12 '19 at 19:33

jpsalm

328
1
8

Thanks for the solution, after some paper work examples I clearly understand how this works. However, is there a intuitive reason why you would AND with exactly the values presented? For instance why is it the case that ((ch & 0xf8) == 0xf0) is the correct way to the check for 4-bytes etc. – NewDev90 Sep 13 '19 at 09:32
@NewProgrammer 4 byte case: We are trying to identify a byte with a 11110xxx pattern, i.e. 4 1s and 1 zero for a total of 5 bits. We & with 0xf8 (0b11111000) to select the upper 5 bits of our char -- the bits we are interested in. Now we have to check that of those five bits the first four are 1 and the last one is 0 so we test for equality against 0xf0 (0b11110000). The other cases are similar but with less bits checked. – jpsalm Sep 13 '19 at 14:21

score 0 · Answer 2 · edited Oct 07 '21 at 13:31

UTF-8 is not hard, but it is stricter than what you realize and what jpsalm's answer suggests. If you want to test that it's valid UTF-8, you need to determine that it conforms to the definition, expressed in ABNF in RFC 3629:

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Alternatively, you can do a bunch of math checking for "non shortest form" and other stuff (surrogate ranges), but that's a huge pain, and highly error-prone. Almost every single implementation I've ever seen done this way, even in major widely used software, has been outright wrong on at least one thing. A state machine that accepts UTF-8 is easy to do and easy to verify that it matches the formal definition. One nice, clean, readable one is described in detail at https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

You've successfully demonstrated that there's more to UTF-8 than many people assume, but this doesn't seem to address the question. — Adrian McCarthy, Sep 12 '19 at 22:20
@AdrianMcCarthy: "I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters." <-- you write a state machine that transitions on each byte read by `fgetc`. — R.. GitHub STOP HELPING ICE, Sep 12 '19 at 22:55
I believe the rest of the question suggests that writing the state machine is the part the OP is having difficulty with. — Adrian McCarthy, Sep 13 '19 at 21:23

How can I determine if a file contains UTF-8 like characters

2 Answers2