Troubles with boost::spirit::lex & whitespace

Question

I try learning to use boost::spirit. To do that, I wanted to create some simple lexer, combine them and then start parsing using spirit. I tried modifying the example, but it doesn't run as expected (the result r isn't true).

Here's the lexer:

#include <boost/spirit/include/lex_lexertl.hpp>

namespace lex = boost::spirit::lex;

template <typename Lexer>
struct lexer_identifier : lex::lexer<Lexer>
{
    lexer_identifier()
        : identifier("[a-zA-Z_][a-zA-Z0-9_]*")
        , white_space("[ \\t\\n]+")
    {
        using boost::spirit::lex::_start;
        using boost::spirit::lex::_end;

        this->self = identifier;
        this->self("WS") = white_space;
    }
    lex::token_def<> identifier;
    lex::token_def<> white_space;
    std::string identifier_name;
};

And this is the example I'm trying to run:

#include "stdafx.h"

#include <boost/spirit/include/lex_lexertl.hpp>
#include "my_Lexer.h"

namespace lex = boost::spirit::lex;

int _tmain(int argc, _TCHAR* argv[])
{
    typedef lex::lexertl::token<char const*,lex::omit, boost::mpl::false_> token_type;
    typedef lex::lexertl::lexer<token_type> lexer_type;

    typedef lexer_identifier<lexer_type>::iterator_type iterator_type;

    lexer_identifier<lexer_type> my_lexer;

    std::string test("adedvied das934adf dfklj_03245");

    char const* first = test.c_str();
    char const* last = &first[test.size()];

    lexer_type::iterator_type iter = my_lexer.begin(first, last);
    lexer_type::iterator_type end = my_lexer.end();

    while (iter != end && token_is_valid(*iter))
    {
        ++iter;
    }

    bool r = (iter == end);

    return 0;
}

r is true as long as there is only one token inside the string. Why is this the case?

Regards Tobias

sehe · Accepted Answer · 2012-11-13T22:49:16.507

You have created a second lexer state, but never invoked it.

Simplify and profit:

For most cases, the easiest way to have the desired effect would be to use single-state lexing with a pass_ignore flag on the skippable tokens:

    this->self += identifier
                | white_space [ lex::_pass = lex::pass_flags::pass_ignore ];

Note that this requires an actor_lexer to allow for the semantic action:

typedef lex::lexertl::actor_lexer<token_type> lexer_type;

Full sample:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
namespace lex = boost::spirit::lex;

template <typename Lexer>
struct lexer_identifier : lex::lexer<Lexer>
{
    lexer_identifier()
        : identifier("[a-zA-Z_][a-zA-Z0-9_]*")
        , white_space("[ \\t\\n]+")
    {
        using boost::spirit::lex::_start;
        using boost::spirit::lex::_end;

        this->self += identifier
                    | white_space [ lex::_pass = lex::pass_flags::pass_ignore ];
    }
    lex::token_def<> identifier;
    lex::token_def<> white_space;
    std::string identifier_name;
};

int main(int argc, const char *argv[])
{
    typedef lex::lexertl::token<char const*,lex::omit, boost::mpl::false_> token_type;
    typedef lex::lexertl::actor_lexer<token_type> lexer_type;

    typedef lexer_identifier<lexer_type>::iterator_type iterator_type;

    lexer_identifier<lexer_type> my_lexer;

    std::string test("adedvied das934adf dfklj_03245");

    char const* first = test.c_str();
    char const* last = &first[test.size()];

    lexer_type::iterator_type iter = my_lexer.begin(first, last);
    lexer_type::iterator_type end = my_lexer.end();

    while (iter != end && token_is_valid(*iter))
    {
        ++iter;
    }

    bool r = (iter == end);
    std::cout << std::boolalpha << r << "\n";
}

Prints

true

"WS" as a Skipper state

It is also possible you came across a sample that uses the second parser state for the skipper (lex::tokenize_and_phrase_parse). Let me take a minute or 10 to create a working sample for that.

Update Took me a bit more than 10 minutes (waaaah) :) Here's a comparative test, showing how the lexer states interact, and how to use Spirit Skipper parsing to invoke the second parser state:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
namespace lex = boost::spirit::lex;
namespace qi  = boost::spirit::qi;

template <typename Lexer>
struct lexer_identifier : lex::lexer<Lexer>
{
    lexer_identifier()
        : identifier("[a-zA-Z_][a-zA-Z0-9_]*")
        , white_space("[ \\t\\n]+")
    {
        this->self       = identifier;
        this->self("WS") = white_space;
    }
    lex::token_def<> identifier;
    lex::token_def<lex::omit> white_space;
};

int main()
{
    typedef lex::lexertl::token<char const*, lex::omit, boost::mpl::true_> token_type;
    typedef lex::lexertl::lexer<token_type> lexer_type;

    typedef lexer_identifier<lexer_type>::iterator_type iterator_type;

    lexer_identifier<lexer_type> my_lexer;

    std::string test("adedvied das934adf dfklj_03245");

    {
        char const* first = test.c_str();
        char const* last = &first[test.size()];

        // cannot lex in just default WS state:
        bool ok = lex::tokenize(first, last, my_lexer, "WS");
        std::cout << "Starting state WS:\t" << std::boolalpha << ok << "\n";
    }

    {
        char const* first = test.c_str();
        char const* last = &first[test.size()];

        // cannot lex in just default state either:
        bool ok = lex::tokenize(first, last, my_lexer, "INITIAL");
        std::cout << "Starting state INITIAL:\t" << std::boolalpha << ok << "\n";
    }

    {
        char const* first = test.c_str();
        char const* last = &first[test.size()];

        bool ok = lex::tokenize_and_phrase_parse(first, last, my_lexer, *my_lexer.self, qi::in_state("WS")[my_lexer.self]);
        ok = ok && (first == last); // verify full input consumed
        std::cout << std::boolalpha << ok << "\n";
    }
}

The output is

Starting state WS:  false
Starting state INITIAL: false
true

Added the "WS" state approach with demo under **`"WS" as a Skipper state`**. Cheers — sehe, Nov 13 '12 at 20:03
Oops. I copied the wrong token_type declaration. It need `mpl::true_` for [`HasState`](http://www.boost.org/doc/libs/1_49_0/libs/spirit/doc/html/spirit/lex/abstracts/lexer_primitives/lexer_token_values.html#spirit.lex.abstracts.lexer_primitives.lexer_token_values.the_anatomy_of_a_token), when dealing with stateful lexers -- obviously! ***Fixed*** — sehe, Nov 13 '12 at 22:51
first of all - thank you for your extensive example. I still have some questions though: what does lex::omit do? And regarding the tokenize_and_parse call: what is my_lexer.self & qi::in_state("WS")[my_lexer.self]? — Tobias Langner, Nov 14 '12 at 07:45
`my_lexer.self` is all tokens for the default lexer state (INITIAL) and `in_state("WS")[my_lexer.self]` means all tokens for the WS lexer state. Those were defined by _you_. The first expression is passed as the parser expression (simply: match any number of tokens) and the second is passed as the skipper (simply: skip any whitespace). — sehe, Nov 14 '12 at 08:10
The second sample in the Lex quickstart docs explains: ["Specifying omit as the token attribute type generates a token class not holding any token attribute at all (not even the iterator range of the matched input sequence), therefore optimizing the token"](http://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/lex/tutorials/lexer_quickstart2.html#spirit.lex.tutorials.lexer_quickstart2.c1). There is more information there — sehe, Nov 14 '12 at 08:11
thank you again. Just for my understanding - if I call this->self("ID_BLA") = bla_token; then it would add a new lexer state called ID_BLA? — Tobias Langner, Nov 14 '12 at 10:40
@TobiasLangner Indeed. It turns out the documentation is a little thin on Lex, I suppose I found it in a sample (?) — sehe, Nov 14 '12 at 10:51

Troubles with boost::spirit::lex & whitespace

1 Answers1

Simplify and profit:

"WS" as a Skipper state

Linked