How do I iterate a std::string to get a set of Freetype FT_ULong?

Question

std::string is commonly interpreted as UTF8, hence has a variable length encoding. In my font renderer I've hit a problem in that I'm not sure how to get a "character" from a std::string and convert it into a Freetype FT_ULong in order to get a glyph with FT_Get_Char_Index. That is to say, I am not sure that what I'm doing is "correct" as I'm just iterating through std::string and casting the resulting chars over (surely this is incorrect, although it works with my OS defaults).

So is there a "correct" way of doing this and more importantly has someone written a library that implements this "correct" way that I can use off the shelf?

You do know how [UTF8](http://en.wikipedia.org/wiki/UTF8) is encoded? Just read the next "character" from the string, and you will know how many bytes is needed for the current code-point and how to parse it into any other encoding. — Some programmer dude, Jul 25 '14 at 10:09
Nothing in the standard, but ICU, glib, see here: http://stackoverflow.com/questions/4579215/cross-platform-iteration-of-unicode-string-counting-graphemes-using-icu (ICU would be the most powerful choice, but is a pretty big beast — peterchen, Jul 25 '14 at 10:28
I wonder if there's a simple function I need that just converts UTF8 to 32 bits (unsigned long). Though I'm reading UTF8 can have up to 6 bytes, which seems a bit excessive but for all practical purposes.... I suppose I also need a function that determines whether the string is well formed or not too. — Robinson, Jul 25 '14 at 11:15
@Robinson UTF8 might be up to six bytes for a single code-point, but it's still only 31 bits of data. You *do* know how it's encoded, yes? — Some programmer dude, Jul 25 '14 at 11:19
I've got it now actually. I found some code where the conversion is done, with a table and looking through it I kind-of "got" it. Still surprising to me that there's no std:: function to do this. Seems a kind-of basic thing these days. http://sydney.edu.au/engineering/it/~graphapp/package/src/utility/utf8.c — Robinson, Jul 25 '14 at 13:25

score 1 · Accepted Answer · answered Jul 25 '14 at 10:21

You should first check how UTF8 is encoded, and would know that what kind of start bits are with how many bytes.

See http://en.wikipedia.org/wiki/UTF8

And then you can write code like this:

  if ((byte & 0x80) == 0x00) {
    // 1 byte UTF8 char
  }
  else if ((byte & 0xE0) == 0xC0) {
    // 2 bytes UTF8 char
  }
  else if ...

Then you can iterates each UTF8 characters in the std::string with correct bytes.

How do I iterate a std::string to get a set of Freetype FT_ULong?

1 Answers1