2

So, I am trying to determine the width, in bytes, of a utf-8 character, based on it's binary representation. And with that, count the number of characters, in a utf8 string. Below is my code.

#include <stdlib.h>
#include <stdio.h>

static const char* test1 = "发f";
static const char* test2 = "ด้ดีด้ดี";

unsigned utf8_char_size(unsigned char val) {
    if (val < 128) {
        return 1;
    } else if (val < 224) {
        return 2;
    } else if (val < 240) {
        return 3;
    } else {
        return 4;
    }
}

unsigned utf8_count_chars(const unsigned char* data)
{
  unsigned total = 0;
  while(*data != 0) {
    unsigned char_width = utf8_char_size(*data);
    total++;
    data += char_width;
  }
  return total;
}

int main(void) {
  fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test1));
  fprintf(stdout, "The count is %u\n", utf8_count_chars((unsigned char*)test2));
  return 0;
}

The problem here is that, I get The count is 2 for the first test runs above. This makes sense for the first one, but with the second one, test2, with 4 thai letters, it prints 8, which is not correct.

I would like to know what my code is doing wrong, and further more, I would like to know given an array of unsigned char in C, how does one iterate through the bytes as utf-8 characters?

Josh Weinstein
  • 2,788
  • 2
  • 21
  • 38
  • 2
    The code is correct. One possibility is that the compiler is not encoding your string constants as utf-8. I'd start with a piece of test code that just prints the decimal value of each element of string `test2`. The utf-8 encoding that I see is `224 184 148 224 185 137 224 184 148 224 184 181 224 184 148 224 185 137 224 184 148 224 184 181` which indicates that the marks above the characters are encoded separately. – user3386109 Jun 23 '19 at 06:57
  • Please do not tag your question as both C and C++. They are different languages and feature different idioms. – L. F. Jun 23 '19 at 07:04
  • @user3386109 You are correct, it seems it prints 8, one for each character and one for each marking above – Josh Weinstein Jun 23 '19 at 07:06
  • @JoshWeinstein The question now contradicts itself. It says *"the count is 2 for both"* and then says *"the second one [...] prints 8"*. In any case, what changes did you make to get a count of 8? – user3386109 Jun 23 '19 at 07:38
  • Please fix the question, it does print 8. – Antti Haapala -- Слава Україні Jun 23 '19 at 07:47
  • 1
    Use `static const char* test1 = u8"ด้ดีด้ดี";` to insure UTF8 encoding. – chux - Reinstate Monica Jun 23 '19 at 12:00
  • An informative step would be `printf("%zu\n", strlen(test2));` simple to see the length of the _utf8 string_. – chux - Reinstate Monica Jun 23 '19 at 12:07

1 Answers1

4

The code measures neither characters nor glyphs but code points. A character can be composed of multiple Unicode codepoints. In this case the Thai text has 8 code points.

Unicode strings are easier to inspect in Python than in C, so here's a small Python 3.6 demonstration using the built-in Unicode database:

>>> import unicodedata
>>> for i in 'ด้ดีด้ดี':
...     print(f'{ord(i):04X} {unicodedata.name(i)}')
... 
0E14 THAI CHARACTER DO DEK
0E49 THAI CHARACTER MAI THO
0E14 THAI CHARACTER DO DEK
0E35 THAI CHARACTER SARA II
0E14 THAI CHARACTER DO DEK
0E49 THAI CHARACTER MAI THO
0E14 THAI CHARACTER DO DEK
0E35 THAI CHARACTER SARA II
  • Yes. @Josh, so are your sure you want to count the number of codepoints and not grapheme clusters (glyphs)? What of significance does the numbers of codepoints correlate to in your use case? – Tom Blodget Jun 23 '19 at 20:15