So, I am trying to determine the width, in bytes, of a utf-8 character, based on it's binary representation. And with that, count the number of characters, in a utf8 string. Below is my code.
#include <stdlib.h>
#include <stdio.h>
static const char* test1 = "发f";
static const char* test2 = "ด้ดีด้ดี";
unsigned utf8_char_size(unsigned char val) {
if (val < 128) {
return 1;
} else if (val < 224) {
return 2;
} else if (val < 240) {
return 3;
} else {
return 4;
}
}
unsigned utf8_count_chars(const unsigned char* data)
{
unsigned total = 0;
while(*data != 0) {
unsigned char_width = utf8_char_size(*data);
total++;
data += char_width;
}
return total;
}
int main(void) {
fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test1));
fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test2));
return 0;
}
The problem here is that, I get The count is 2
for the first test runs above. This makes sense for the first one, but with the second one, test2
, with 4 thai letters, it prints 8, which is not correct.
I would like to know what my code is doing wrong, and further more, I would like to know given an array of unsigned char
in C, how does one iterate through the bytes as utf-8 characters?