How to get the UNICODE code from each character of a UTF-8 string?

Question

With C++11, how can I, from an UTF-8 encoded std::string, get the Unicode value of each character of the text into an uint32_t?

Something like:

void f(const std::string &utf8_str)
{
    for(???) {
       uint32_t code = ???;

       /* Do my stuff with the code... */
    }
}

Does assuming the host system locale is UTF-8 helps? What standard library tools C++11 offers for the task?

score 5 · Accepted Answer · answered Feb 11 '14 at 20:01

5

You can simply convert the string into a UTF-32 encoded one, using the provided conversion facet and std::wstring_convert from <locale>:

#include <codecvt>
#include <locale>
#include <string>

void foo(std::string const & utf8str)
{
     std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
     std::u32string utf32str = conv.from_bytes(utf8str);

     for (char32_t u : utf32str)  { /* ... */ }
}

answered Feb 11 '14 at 20:01

Kerrek SB

464,522
92
875
1,084

Do you know how to get, instead of UTF-8, the codecvt for the system native encoding? – lvella Feb 11 '14 at 20:30
@lvella: You can convert the system's narrow encoding to UTF32 with [`mbrtoc32`](http://en.cppreference.com/w/cpp/string/multibyte/mbrtoc32). The table at the bottom of the linked page shows all the available combinations. ([I'm not sure](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented) if `` is widely implemented yet, though.) – Kerrek SB Feb 11 '14 at 20:35
1

@lvella the system's native encoding instead of UTF-8? If you mean something like GB18030 (another 8-bit Unicode format), then you can use codecvt_byname or pull it out of the locale with use_facet. [This example](http://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert) shows how to build a wstring_convert with it. – Cubbi Feb 11 '14 at 21:53
...and you probably need to call `std::setlocale(LC_CTYPE, "")` or something appropriate for the stream in question to obtain the actual system locale... – Kerrek SB Feb 11 '14 at 22:10
That `mbrtoc32` is a very weird thing... it documents negative return values, but returns a `size_t`, that is unsigned. – lvella Feb 12 '14 at 02:17

score 1 · Answer 2 · answered Feb 11 '14 at 19:57

Using <utf8.h> from http://utfcpp.sourceforge.net/ you could code:

 static inline void fix_utf8_string(std::string& str)
 {
   std::string temp;
   utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
   str = temp;
 }

 static inline bool valid_utf8_cstr(const char*s)
 {
   if (!s) return false;
   const char* e = s+strlen(s);
   return utf8::is_valid(s,e);
 }

 static inline size_t
 utf8_length(const char*s)
 {
   if (!s) return 0;
   const char* e = s+strlen(s);
   return utf8::distance(s,e);
 }


 // apply a function to every code point, exiting if that function
 // gives true and return the number of visited code points
 static inline size_t
 utf8_foreach_if(const char*s, 
                 std::function<bool(uint32_t,size_t)>f)
 {
   if (!s) return 0;
   size_t ix=0;
   const char*pc = s;
   while(*pc)
     {
       const char*epc
         = (pc[1]==0)?(pc+1):(pc[2]==0)
              ?(pc+2):(pc[3]==0)?(pc+3):(pc+4);
       uint32_t c = utf8::next(pc,epc);
       if (f(c,ix)) break;
       ix++;
     };
   return ix;
 }

 static inline size_t
 utf8_foreach_if(const std::string& s, 
                 std::function<bool(uint32_t,size_t)>f)
 {
   if (s.empty()) return 0;
   size_t ix=0;
   const char*pc = s.c_str();
   const char*epc = pc + s.size();
   while(*pc)
     {
       uint32_t c = utf8::next(pc,epc);
       if (f(c,ix)) break;
       ix++;
     };
   return ix;
 }

This is extracted from some code licensed under GPLv3 that I will release in a few weeks or months.

How to get the UNICODE code from each character of a UTF-8 string?

2 Answers2

Linked