I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.
However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).
1 bytes: 0xxxxxxx
For the remainder I believe the pattern to be as such:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:
while ((check = fgetc(fp)) != EOF) {
if (*) {
// do something to the code
}
}
However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).
As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:
if (check >= 0 && check <= 127) {
// do something to the code
}
Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?