Retrieving code points using Boost.Locale library

Question

From a given Unicode string I wanted to retrieve the list of code points that make up the string. To do so I copied the following example from Boost's character iteration example :

#include <boost/locale.hpp>

using namespace boost::locale::boundary;

int main()
{
    boost::locale::generator gen;
    std::string text = "To be or not to be";

    // Create mapping of text for token iterator using global locale.
    ssegment_index map(character, text.begin(), text.end(), gen("en_US.UTF-8"));

    // Print all "words" -- chunks of word boundary
    for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
        std::cout <<"\""<< * it << "\", ";
    }
    std::cout << std::endl;

    return 0;
}

It returns me characters (which are different from code points as per Boost's documentation) like this :

"T", "o", " ", "b", "e", " ", "o", "r", " ", "n", "o", "t", " ", "t", "o", " ", "b", "e",

I read that using the to_unicode function in boost::locale::util::base_converter class you can retrieve code points of a given string. But I am not sure how. I tried the following code, but no help:

for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it) {
    std::cout << "\"" << * it << "\", ";
    boost::locale::util::base_converter encoder_decoder;
    virtual uint32_t test1 = encoder_decoder.to_unicode(it->begin(), it->end() );
}

It returns 'Type mismatch' error. I think the parameters of to_unicode() function must be something different

I am considering to use only Boost to retrieve code points than existing solutions like here or here because Boost provides lots of useful functions to identify line breaks, word breaks, etc in all sorts of languages.

score 1 · Answer 1 · answered May 23 '16 at 19:22

To get the codepoints you can use the boost::u8_to_u32_iterator. This is as a UTF-32 character is equal to its codepoint.

#include <boost/regex/pending/unicode_iterator.hpp>
#include <string>
#include <iostream>

void printCodepoints(std::string input) {
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it)
        std::cout <<"\""<< * it << "\", ";
}

int main() {
    printCodepoints("Hello World!");
    return 0;
}

Retrieving code points using Boost.Locale library

1 Answers1