0

std::string is commonly interpreted as UTF8, hence has a variable length encoding. In my font renderer I've hit a problem in that I'm not sure how to get a "character" from a std::string and convert it into a Freetype FT_ULong in order to get a glyph with FT_Get_Char_Index. That is to say, I am not sure that what I'm doing is "correct" as I'm just iterating through std::string and casting the resulting chars over (surely this is incorrect, although it works with my OS defaults).

So is there a "correct" way of doing this and more importantly has someone written a library that implements this "correct" way that I can use off the shelf?

leemes
  • 44,967
  • 21
  • 135
  • 183
Robinson
  • 9,666
  • 16
  • 71
  • 115
  • 1
    You do know how [UTF8](http://en.wikipedia.org/wiki/UTF8) is encoded? Just read the next "character" from the string, and you will know how many bytes is needed for the current code-point and how to parse it into any other encoding. – Some programmer dude Jul 25 '14 at 10:09
  • Or you could convert it to UTF-16 (or even UTF-32). – leemes Jul 25 '14 at 10:10
  • Nothing in the standard, but ICU, glib, see here: http://stackoverflow.com/questions/4579215/cross-platform-iteration-of-unicode-string-counting-graphemes-using-icu (ICU would be the most powerful choice, but is a pretty big beast – peterchen Jul 25 '14 at 10:28
  • @leemes: UTF-16 is a variable length encoding as well – peterchen Jul 25 '14 at 10:29
  • I wonder if there's a simple function I need that just converts UTF8 to 32 bits (unsigned long). Though I'm reading UTF8 can have up to 6 bytes, which seems a bit excessive but for all practical purposes.... I suppose I also need a function that determines whether the string is well formed or not too. – Robinson Jul 25 '14 at 11:15
  • @Robinson UTF8 might be up to six bytes for a single code-point, but it's still only 31 bits of data. You *do* know how it's encoded, yes? – Some programmer dude Jul 25 '14 at 11:19
  • I've got it now actually. I found some code where the conversion is done, with a table and looking through it I kind-of "got" it. Still surprising to me that there's no std:: function to do this. Seems a kind-of basic thing these days. http://sydney.edu.au/engineering/it/~graphapp/package/src/utility/utf8.c – Robinson Jul 25 '14 at 13:25

1 Answers1

1

You should first check how UTF8 is encoded, and would know that what kind of start bits are with how many bytes.

See http://en.wikipedia.org/wiki/UTF8

And then you can write code like this:

  if ((byte & 0x80) == 0x00) {
    // 1 byte UTF8 char
  }
  else if ((byte & 0xE0) == 0xC0) {
    // 2 bytes UTF8 char
  }
  else if ...

Then you can iterates each UTF8 characters in the std::string with correct bytes.

Mine
  • 4,123
  • 1
  • 25
  • 46