From a given Unicode string I wanted to retrieve the list of code points that make up the string. To do so I copied the following example from Boost's character iteration example :
#include <boost/locale.hpp>
using namespace boost::locale::boundary;
int main()
{
boost::locale::generator gen;
std::string text = "To be or not to be";
// Create mapping of text for token iterator using global locale.
ssegment_index map(character, text.begin(), text.end(), gen("en_US.UTF-8"));
// Print all "words" -- chunks of word boundary
for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
std::cout <<"\""<< * it << "\", ";
}
std::cout << std::endl;
return 0;
}
It returns me characters (which are different from code points as per Boost's documentation) like this :
"T", "o", " ", "b", "e", " ", "o", "r", " ", "n", "o", "t", " ", "t", "o", " ", "b", "e",
I read that using the to_unicode
function in boost::locale::util::base_converter class you can retrieve code points of a given string. But I am not sure how. I tried the following code, but no help:
for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
std::cout << "\"" << * it << "\", ";
boost::locale::util::base_converter encoder_decoder;
virtual uint32_t test1 = encoder_decoder.to_unicode(it->begin(), it->end() );
}
It returns 'Type mismatch' error. I think the parameters of to_unicode()
function must be something different
I am considering to use only Boost to retrieve code points than existing solutions like here or here because Boost provides lots of useful functions to identify line breaks, word breaks, etc in all sorts of languages.