Handling utf-8 in Boost.Spirit with utf-32 parser

Question

I have a similar issue like How to use boost::spirit to parse UTF-8? and How to match unicode characters with boost::spirit? but none of these solve the issue i'm facing. I have a std::string with UTF-8 characters, i used the u8_to_u32_iterator to wrap the std::string and used unicode terminals like this:

BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) {
        using namespace boost::spirit::qi;
        u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end());
        std::vector<request_header_narrow_utf8_wrapper> wrapper_container;
        parse(
            begin, end,
            *(
                +(alnum|(punct-':'))
                >> lit(": ")
                >> +((unicode::alnum|space|punct) - '\r' - '\n')
                >> lit("\r\n")
            )
            >> lit("\r\n")
            , wrapper_container
            );
        BOOST_FOREACH(request_header_narrow_utf8_wrapper header_wrapper, wrapper_container)
        {
            request_header_narrow header;
            u32_to_u8_iterator<request_header_narrow_utf8_wrapper::string_type::iterator> name_begin(header_wrapper.name.begin()),
                                                                                          name_end(header_wrapper.name.end()),
                                                                                          value_begin(header_wrapper.value.begin()),
                                                                                          value_end(header_wrapper.value.end());
            for(; name_begin != name_end; ++name_begin)
                header.name += *name_begin;
            for(; value_begin != value_end; ++value_begin)
                header.value += *value_begin;
            container.push_back(header);
       }
    }

The request_header_narrow_utf8_wrapper is defined and mapped to Fusion like this (don't mind the missing namespace declarations):

struct request_header_narrow_utf8_wrapper
{
    typedef std::basic_string<boost::uint32_t> string_type;
    std::basic_string<boost::uint32_t> name, value;
};

BOOST_FUSION_ADAPT_STRUCT(
    boost::network::http::request_header_narrow_utf8_wrapper,
    (std::basic_string<boost::uint32_t>, name)
    (std::basic_string<boost::uint32_t>, value)
    )

This works fine, but i was wondering can i somehow manage to make the parser assing directly to a struct containing std::string members instead of doing the for-each loop with the u32_to_u8_iterator ? I was thinking that one way could be making a wrapper for std::string that would have an assignment operator with boost::uint32_t so that parser could assign directly, but are there other solutions?

EDIT

After reading some more i ended up with this:

namespace boost { namespace spirit { namespace traits {

    typedef std::basic_string<uint32_t> u32_string;

   /* template <>
    struct is_string<u32_string> : mpl::true_ {};*/

    template <> // <typename Attrib, typename T, typename Enable>
    struct assign_to_container_from_value<std::string, u32_string, void>
    {
        static void call(u32_string const& val, std::string& attr) {
            u32_to_u8_iterator<u32_string::const_iterator> begin(val.begin()), end(val.end());
            for(; begin != end; ++begin)
                attr += *begin;
        }
    };

} // namespace traits

} // namespace spirit

} // namespace boost

and this

BOOST_NETWORK_INLINE void parse_headers(std::string const & input, std::vector<request_header_narrow> & container) {
        using namespace boost::spirit::qi;
        u8_to_u32_iterator<std::string::const_iterator> begin(input.begin()), end(input.end());
        parse(
            begin, end,
            *(
                as<boost::spirit::traits::u32_string>()[+(alnum|(punct-':'))]
                >> lit(": ")
                >> as<boost::spirit::traits::u32_string>()[+((unicode::alnum|space|punct) - '\r' - '\n')]
                >> lit("\r\n")
            )
            >> lit("\r\n")
            , container
            );
    }

Any comments or advice if this is the best i can get?

sehe · Accepted Answer · 2013-10-10T08:48:18.830

Another job for a attribute trait. I've simplified your datatypes for demonstration purposes:

typedef std::basic_string<uint32_t> u32_string;

struct Value 
{
    std::string value;
};

Now you can have the conversion happen "auto-magically" using:

namespace boost { namespace spirit { namespace traits {
    template <> // <typename Attrib, typename T, typename Enable>
        struct assign_to_attribute_from_value<Value, u32_string, void>
        {
            typedef u32_to_u8_iterator<u32_string::const_iterator> Conv;

            static void call(u32_string const& val, Value& attr) {
                attr.value.assign(Conv(val.begin()), Conv(val.end()));
            }
        };
}}}

Consider a sample parser that parses JSON-style strings in UTF-8, while also allowing Unicode escape sequences of 32-bit codepoints: \uXXXX. It is convenient to have the intermediate storage be a u32_string for this purpose:

///////////////////////////////////////////////////////////////
// Parser
///////////////////////////////////////////////////////////////

namespace qi         = boost::spirit::qi;
namespace encoding   = qi::standard_wide;
//namespace encoding = qi::unicode;

template <typename It, typename Skipper = encoding::space_type>
    struct parser : qi::grammar<It, Value(), Skipper>
{
    parser() : parser::base_type(start)
    {
        string = qi::lexeme [ L'"' >> *char_ >> L'"' ];

        static qi::uint_parser<uint32_t, 16, 4, 4> _4HEXDIG;

        char_ = +(
                ~encoding::char_(L"\"\\")) [ qi::_val += qi::_1 ] |
                    qi::lit(L"\x5C") >> (                    // \ (reverse solidus)
                    qi::lit(L"\x22") [ qi::_val += L'"'  ] | // "    quotation mark  U+0022
                    qi::lit(L"\x5C") [ qi::_val += L'\\' ] | // \    reverse solidus U+005C
                    qi::lit(L"\x2F") [ qi::_val += L'/'  ] | // /    solidus         U+002F
                    qi::lit(L"\x62") [ qi::_val += L'\b' ] | // b    backspace       U+0008
                    qi::lit(L"\x66") [ qi::_val += L'\f' ] | // f    form feed       U+000C
                    qi::lit(L"\x6E") [ qi::_val += L'\n' ] | // n    line feed       U+000A
                    qi::lit(L"\x72") [ qi::_val += L'\r' ] | // r    carriage return U+000D
                    qi::lit(L"\x74") [ qi::_val += L'\t' ] | // t    tab             U+0009
                    qi::lit(L"\x75")                         // uXXXX                U+XXXX
                        >> _4HEXDIG [ qi::_val += qi::_1 ]
                );

        // entry point
        start = string;
    }

    private:
    qi::rule<It, Value(),  Skipper> start;
    qi::rule<It, u32_string()> string;
    qi::rule<It, u32_string()> char_;
};

As you can see, the start rule simply assigns the attribute value to the Value struct - which implicitely invokes our assign_to_attribute_from_value trait!

A simple test program Live on Coliru to prove that it does work:

// input assumed to be utf8
Value parse(std::string const& input) {
    auto first(begin(input)), last(end(input));

    typedef boost::u8_to_u32_iterator<decltype(first)> Conv2Utf32;
    Conv2Utf32 f(first), saved = f, l(last);

    static const parser<Conv2Utf32, encoding::space_type> p;

    Value parsed;
    if (!qi::phrase_parse(f, l, p, encoding::space, parsed))
    {
        std::cerr << "whoops at position #" << std::distance(saved, f) << "\n";
    }

    return parsed;
}

#include <iostream>

int main()
{
    Value parsed = parse("\"Footnote: ¹ serious busineş\\u1e61\n\"");
    std::cout << parsed.value;
}

Now observe that the output is encoded in UTF8 again:

$ ./test | tee >(file -) >(xxd)

Footnote: ¹ serious busineşṡ
/dev/stdin: UTF-8 Unicode text
0000000: 466f 6f74 6e6f 7465 3a20 c2b9 2073 6572  Footnote: .. ser
0000010: 696f 7573 2062 7573 696e 65c5 9fe1 b9a1  ious busine.....
0000020: 0a

The U+1E61 code-point has been correctly encoded as [0xE1,0xB9,0xA1].

Sorry for the long delay, but I need additional guidance - my parser (as in the example) always calls `assign_to_attribute_from_value` with template parameters `char` and `unsigned int` (which makes sense since), is that because of the grammar? Is there any way to force it to use `int32_t` internally and then assign a whole `u32_string` to `std::string`? I am definitely missing something. — Rudolfs Bundulis, Oct 14 '13 at 11:27
Yes, that is possible, it's what I do in my answer. For comparison, look at the `no_wide` branch of my toy JSON parser ([specifically `https://github.com/sehe/spirit-v2-json/blob/nowide/JSON.cpp#L48`](https://github.com/sehe/spirit-v2-json/blob/nowide/JSON.cpp#L48)) if you (want to) have it the other way. — sehe, Oct 14 '13 at 12:05
Yeah, I see, but another question - is this possible without actually creating a specific parser? In your answer you have made an explicit parser declaration, while in my example, the automatic parsing function is used. I'm not lazy or dumb, it is just 3rd party code and I want to understand what is the smallest set of modifications needed. Can I anyhow force the default char_parser to use the assignment structure that you have shown? Thanks for the help. — Rudolfs Bundulis, Oct 14 '13 at 13:30
Thanks again for all the info, I edited my question with the solution that i came down to that seems to work and does not involve making a new parser. Can you give any hints if that is the best i can do? — Rudolfs Bundulis, Oct 14 '13 at 16:35
I will have a look later (no time now). You could always answer your own question, so people can see that you found an answer. Later — sehe, Oct 14 '13 at 17:14
I'd suggest `static const qi::as as_u32string;` out-of-line so you don't complicate the grammar as much. Also, keep in mind this converts encodings twice, which is not optimal. For the rest, yes, this is how you'd apply my answer without explicit `qi::rule`s — sehe, Oct 14 '13 at 21:59

Handling utf-8 in Boost.Spirit with utf-32 parser

1 Answers1

Linked