How to use boost::spirit to parse UTF-8?

Question

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>

void parse_simple_string()
{
    namespace qi = boost::spirit::qi;    
    namespace encoding  = boost::spirit::unicode;
    //namespace stw = boost::spirit::standard_wide;

    typedef std::wstring::const_iterator iterator_type;

    std::vector<std::wstring> result;
    std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";

    qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
    qi::phrase_parse(input.begin(), input.end(),
                     key % qi::lit(L"\",\""),
                     encoding::space,
                     result);

    //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t>  (std::wcout, L"\n"));
    for(auto const &data : result) std::wcout<<data<<std::endl;
}

I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好"

the expected results should be

12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好

but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP

Failed to parse chinese words "你好"

OS is win7 64bits, my editor save the words as UTF-8

I'm confused. You are ... using UTF8? Why the wstring then? (UTF8 is an encoding single/double/triple byte-sequences, right). I don't feel qualified to explain better, but this is a mismatch in my perception — sehe, Dec 06 '12 at 10:47
1-4 bytes. But yes, that's a fairly glaring mismatch. Until `char8_t` is introduced, `char` is the UTF-8 type of choice for most. — Puppy, Dec 06 '12 at 11:25
What everyone said. `wstring` is just wrong when using UTF-8. If you want properly encoded UTF-8 literals, *especially* on Windows, the safest way is to either the C++11 literals `u8"blah"` (which are not in Visual Studio yet) or use byte escapes with the right encoding directly, i.e. "\xE4\xBD\xA0\xE5\xA5\xBD" instead of "你好". — R. Martinho Fernandes, Dec 06 '12 at 11:29

Evgeny Panasyuk · Accepted Answer · 2015-04-16T00:06:45.187

If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.

For instance, use boost::u8_to_u32_iterator:

A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.

live demo

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
    using namespace boost;
    using namespace spirit::qi;
    using namespace std;

    auto &&utf8_text=u8"你好，世界！";
    u8_to_u32_iterator<const char*>
        tbegin(begin(utf8_text)), tend(end(utf8_text));

    vector<uint32_t> result;
    parse(tbegin, tend, *standard_wide::char_, result);
    for(auto &&code_point : result)
        cout << "&#" << code_point << ";";
    cout << endl;
}

Output is:

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;

score 2 · Answer 2 · answered Aug 24 '21 at 13:10

Although the answer of Evgeny Panasyuk is correct, the use of u8_to_u32_iterator may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following:

File foobar.cpp

#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>

int main() {
    const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};

    using utf8_iter = boost::u8_to_u32_iterator<const char *>;
    auto iter = utf8_iter{contents};
    auto end = utf8_iter{contents + sizeof(contents)};

    for (; iter != end; ++iter)
        std::cout << *iter << '\n';
}

When compiled with the commands clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp then run, clang address sanitizer will display stack-buffer-overflow error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow.

If the last byte is NUL const char contents[] = "Hello\xF1";, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors.

In short, make sure the input is NUL terminated before using boost::u8_to_u32_iterator or you may risk encountering UB.

How to use boost::spirit to parse UTF-8?

2 Answers2

Linked