How do you interface C++ flex with C++ Bison?

Question

I'm trying to interface a C++ flex with a C++ bison and I'm stumped. The Bison 3.8.1 manual has an example of a C++ Bison w/a C Flex. The Flex 2.6.4 doesn't have an example. The issue that I'm trying to address is how to provide an instance pointer to a C++ Flex object to C++ (or C) Bison. The best idea that I have is to use YY_DECL to define the Flex scanner to use # define YY_DECL bison::symbol_type flx->yylex() and to pass flx into Bison through the parser calling sequence, redefining 'parse'. Is this right, and is there a better way?

So, do you want C or C++ bison? This makes a difference because if I remember right, C++ parser is always reetrant. — Piotr Siupa, Jun 20 '23 at 08:33
@PiotrSiupa I would like to interface a Bison generated C++ file with a Flex generated C++ file. The Flex and Bison object files should work with each other. I'm having difficulty with both scripts at the moment. The Flex manual is some 9 years old and does not describe the C++ options (in FlexLexer.h) and options and other things necessary for the interface. Bison is similar, and there example is for a Flex generated C lexer. One issue seen today is that the Flex C++ lexer references a Bison function as a static function. — lostbits, Jun 20 '23 at 22:18

Piotr Siupa · Answer 1 · 2023-07-04T05:42:23.743

While switching Flex and Bison to C++ is as easy as adding flags %option c++ and %language "c++" respectively, in both cases this causes the resulting code to become re-entrant, which, as you've noticed, interferes with interoperability between those two.

By default in C language, both Flex and Bison store their states in global variables. In C++, they are object oriented instead. Flex have a class yyFlexLexer and Bison has class yy::parser. This is a more natural approach in this language and additionally it let's you run the parser multiple times by creating new object of these classes. You can even run multiple parsers at once in a multi-threaded program.

There is a catch, however. While both lexer and parser are C++ and re-entrant now, they still assume that their counterpart is a default non re-entrant code. Because of that, they are trying to access the global state variables that no longer exist. Fixing this requires some tinkering.

A minimal example

A complete example, that can be copy-pasted as a base of a new program, will be more useful that just an explanation.

Let's start with a minimal example that just shows how to make C++ Flex and Bison communicate. We'll write a short Flex-Bison program that expects input in format Hello X! and prints back Goodbye X!.

fooLexer.ll:

%{
    #include "FooLexer.hh"
    #include "fooParser.tab.hh"
    
    #undef  YY_DECL
    #define YY_DECL int FooLexer::yylex(std::string *const yylval)
%}

%option c++ noyywrap

%option yyclass="FooLexer"

%%

[[:space:]] ;
Hello { return yy::parser::token::HELLO; }
[[:alpha:]]+ { *yylval = std::string(yytext, yytext + yyleng); return yy::parser::token::WORLD; }
. { return yytext[0]; }

FooLexer.hh:

#pragma once

#include <string>
#if ! defined(yyFlexLexerOnce)
#include <FlexLexer.h>
#endif

class FooLexer : public yyFlexLexer
{
public:
    int yylex(std::string *const yylval);
};

These two files are our lexer. Instead of using the default lexer class, we define our own which inherits from it. We do it because the default implementation doesn't take arguments to the function yylex and we need one to pass yylval into it.

Let's break down the most interesting lines:

#undef YY_DECL - C++ Flex still makes heavy use of macros. YY_DECL stores the declaration of function yylval that it will generate. We remove the default value, which is int FooLexer::yylex().
#define YY_DECL int FooLexer::yylex(std::string *const lval) - Now, we replace the removed value with the function declaration we need.
%option c++ - We switch the output language to C++.
%option yyclass="FooLexer" - Finally, we set which class should be used by lexer instead of the yyFlexLexer. It will create the method yylex in this class.
#include <FlexLexer.h> - Unlike the C code, C++ code generated by Flex requires an external header FlexLexer.h. It should be installed in your system along with Flex.
#if ! defined(yyFlexLexerOnce) & #endif - We use the Flex mechanism of ensuring that the header <lexLexer.h> is added only once. (This is a little non-standard solution but allows us to include it multiple times if there is a need for that.)
int yylex(std::string *const yylval); - We do declare the function but the definition is provided by Flex.

fooParser.yy:

%require "3.2"
%language "c++"

%code requires {
    #include <string>
    #include "FooLexer.hh"
}

%define api.value.type {std::string}

%parse-param {FooLexer &lexer}

%header

%code {
    #define yylex lexer.yylex
}

%token HELLO
%token WORLD

%%

hello_world: HELLO WORLD '!' { std::cout << "Goodbye " << $WORLD << '!' << std::endl; }

%%

void yy::parser::error(const std::string &message)
{
    std::cerr << "Error: " << message << std::endl;
}

In the case of the parser, we do not create our own class. Bison is a little smarter about this and it makes adjusting the code much simpler. For example, it correctly guesses that is should take yylval as an argument, so we don't need to worry about that.

Still, there is a few notable changes:

%require "3.2" - This directive not only makes sure the installed version of Bison supports C++. It also prevents creation of a redundant result file stack.hh.
%language "c++" - We switch the output language to C++.
%parse-param {FooLexer &lexer} - This directive adds an additional argument to the constructor of parser class. We use it to pass a lexer to the parser.
#define yylex lexer.yylex - Parser still assumes that yylex is a global function. We use preprocessor to change that to a method of the lexer we're passing to the constructor.
void yy::parser::error(const std::string &message) - We no longer need to declare the error handler at the beginning of the file. However, we still need to define it. The definition points now to a namespace yy and class parser which is the default location of the parser class.

main.cc:

#include "FooLexer.hh"
#include "fooParser.tab.hh"

int main()
{
    FooLexer lexer;
    yy::parser parser(lexer);
    return parser();
}

Now we just need to create objects of lexer and parser classes and we ready. The parser class is a functor so we can simply call it.

Bonus - makefile:

.RECIPEPREFIX = >

prog: main.o fooParser.tab.o lex.yy.o
> g++ $^ -o $@

main.o: main.cc FooLexer.hh fooParser.tab.hh
> g++ -c $< -o $@

lex.yy.o: lex.yy.cc FooLexer.hh fooParser.tab.hh
> g++ -c $< -o $@

fooParser.tab.o: fooParser.tab.cc FooLexer.hh
> g++ -c $< -o $@

lex.yy.cc: fooLexer.ll
> flex $<

fooParser.tab.hh fooParser.tab.cc fooParser.output: fooParser.yy
> bison $<

.PHONY: clean
clean:
> rm -f prog main.o lex.* fooParser.tab.* stack.hh

An extended example

Let's expand on this example to, on one hand, see how to add/modify various aspect of a C++ parser and, on the other hand, turn it into a code that is ready to use in a real application.

Currently, lexer and parser are in different namespaces, so we will put both of them into the same one (foo). We will also change their names to ones we choose. (This include the name of the original lexer class too, for technical reasons which are explained later.)

We will modify the constructor of the lexer to be able to pass a file to it, instead of reading stdin.

We will add location to our parser, to track input line numbers and give more meaningful error messages.

We will also add to the program capability to print debug log, to aid in writing complex parsers.

Finally, we will enable a few useful miscellaneous options and add some helper functions.

location_t.hh:

#pragma once

#include <cstddef>
#include <ostream>
#include <utility>

namespace foo
{
    using position_t = std::size_t;
    using location_t = std::pair<std::size_t, std::size_t>;
}

inline std::ostream& operator<<(std::ostream& os, const foo::location_t& loc)
{
    return os << "[" << loc.first << "-" << loc.second << "]";
}

To enable tracking of token location in Bison we can either use the default provided implementation of a location class or create our own. I'm finding the default implementation a little lacking, so we've taken the second option.

Bison names the location-related types as follows:

"position" - a specific point in a file (default Bison implementation),
"location" - location of a token defined by its start and end position (default Bison implementation).

For consistency, we've used the same convention in our implementation.

This is a very simple implementation, where the position is just a single integer, storing a line number. In a real program, I recommend to track line number and column at least, and maybe even an absolute position in a file.

We've also added on operator<< for our location. It is useful in general but in our case it is strictly required because Bison uses it in the debug logs (which we will enable).

fooLexer.ll:

%{
    #include "FooLexer.hh"
    #include "fooParser.tab.hh"
    
    using namespace foo;
    
    #undef  YY_DECL
    #define YY_DECL int FooLexer::yylex(std::string *const lval, location_t *const lloc)
    
    #define YY_USER_INIT yylval = lval; yylloc = lloc;
    
    #define YY_USER_ACTION copyLocation();
%}

%option c++ noyywrap debug

%option yyclass="FooLexer"
%option prefix="yy_foo_"

%%

%{
    using Token = FooBisonParser::token;
%}

\n { ++currentLine; }
[[:space:]] ;
Hello { return Token::HELLO; }
[[:alpha:]]+ { copyValue(); return Token::WORLD; }
. { return yytext[0]; }

FooLexer.hh:

#pragma once

#include <string>
#if ! defined(yyFlexLexerOnce)
#define yyFlexLexer yy_foo_FlexLexer
#include <FlexLexer.h>
#undef yyFlexLexer
#endif
#include "location_t.hh"

namespace foo
{
    class FooLexer : public yy_foo_FlexLexer
    {
        std::size_t currentLine = 1;
        
        std::string *yylval = nullptr;
        location_t *yylloc = nullptr;
        
        void copyValue(const std::size_t leftTrim = 0, const std::size_t rightTrim = 0, const bool trimCr = false);
        void copyLocation() { *yylloc = location_t(currentLine, currentLine); }
        
    public:
        FooLexer(std::istream &in, const bool debug) : yy_foo_FlexLexer(&in) { yy_foo_FlexLexer::set_debug(debug); }
        
        int yylex(std::string *const lval, location_t *const lloc);
    };
    
    inline void FooLexer::copyValue(const std::size_t leftTrim, const std::size_t rightTrim, const bool trimCr)
    {
        std::size_t endPos = yyleng - rightTrim;
        if (trimCr && endPos != 0 && yytext[endPos - 1] == '\r')
            --endPos;
        *yylval = std::string(yytext + leftTrim, yytext + endPos);
    }
}

There is a lot of changes in our lexer, most of which enables locations, a few is to edit namespaces and names, and the rest is just for our future convenience:

using namespace foo; - We cannot put the entire code of the lexer into a namespace, so this is the next best option. (This is considered a bad practice but I think in this particular case it is rather harmless.)
#define YY_DECL int FooLexer::yylex(std::string *const lval, location_t *const lloc) - We've added an argument lloc to the parser, which is the location passed by the parser. (YY_DECL)
#define YY_USER_INIT yylval = lval; yylloc = lloc; - We cannot write our own implementation of yylex but YY_USER_INIT let us insert some additional code at the beginning of the default implementation. We've used it to save the function arguments into fields of our object. This will let us easily access them from other methods.
#define YY_USER_ACTION copyLocation(); - YY_USER_ACTION is inserted in front of every action in the lexer. We've used it to copy location of each token into the yylloc.
%option prefix="yy_foo_" - We've changed the default prefix yy used by Flex to yy_foo_. Effectively, this will change the name of the internal lexer class (the one we inherit from) to yy_foo_. This is necessary, if we need more than one lexer in our program. In that case, each lexer needs a different prefix in order to avoid name collisions.
using Token = FooBisonParser::token; - This just lets us write Token in action instead of the full FooBisonParser::token.
\n { ++currentLine; } - We still don't emit tokens on any whitespaces but we need to increase our internal line counter every time we encounter a line break.
#define yyFlexLexer yy_foo_FlexLexer & #undef yyFlexLexer - Not all the code of the lexer is generated. We are also including the header file that has no idea that we've changed the lexer prefix. This trick fixes that problem. (If you have multiple lexers, you need to include this header multiple times, with different #defines.)
std::size_t currentLine = 1; - Our internal field, we use to track the current line number for yylloc.
std::string *yylval = nullptr; & location_t *yylloc = nullptr; - Fields with copies of pointers passed by parser to yylex. They are here for easier access of these pointers in other methods of the class.
void copyValue(const std::size_t leftTrim = 0, const std::size_t rightTrim = 0, const bool trimCr = false); - A convenient method that let us easily copy the current contents of yytext into yylval. We can use it in actions. I found that the option to cut off a few characters from the beginning and the end of the string is very useful, for example when we matched a string literal and only want to copy its contents, without ". An option to remove trailing '\r' also have uses.
void copyLocation() - A convenient method to save the location of the current token into yylloc. It will become more complicated if there are multiline tokens in the grammar.
FooLexer(std::istream &in, const bool debug) : yy_foo_FlexLexer(&in) { yy_foo_FlexLexer::set_debug(debug); } - We've added more arguments to the constructor, which let us choose the input source, as well as turn on debug logs in the lexer.

fooParser.yy:

%require "3.2"
%language "c++"

%code requires {
    #include <string>
    #include "location_t.hh"
    #include "FooLexer.hh"
}

%define api.namespace {foo}
%define api.parser.class {FooBisonParser}
%define api.value.type {std::string}
%define api.location.type {location_t}

%locations
%define parse.error detailed
%define parse.trace

%header
%verbose

%parse-param {FooLexer &lexer}
%parse-param {const bool debug}

%initial-action
{
    #if YYDEBUG != 0
        set_debug_level(debug);
    #endif
};

%code {
    namespace foo
    {
        template<typename RHS>
        void calcLocation(location_t &current, const RHS &rhs, const std::size_t n);
    }
    
    #define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
    #define yylex lexer.yylex
}

%token HELLO
%token WORLD

%expect 0

%%

hello_world: HELLO WORLD '!' { std::cout << "Goodbye " << $WORLD << '!' << std::endl; }

%%

namespace foo
{
    template<typename RHS>
    inline void calcLocation(location_t &current, const RHS &rhs, const std::size_t n)
    {
        current = location_t(YYRHSLOC(rhs, 1).first, YYRHSLOC(rhs, n).second);
    }
    
    void FooBisonParser::error(const location_t &location, const std::string &message)
    {
        std::cerr << "Error at lines " << location << ": " << message << std::endl;
    }
}

Bison interface is a little more user friendly than Flex when it comes to changes we're about to make but adding custom locations will still require significant amount of code.

%define api.namespace {foo} - We've instructed Bison to put all it's code into a namespace foo instead of the default yy.
%define api.parser.class {FooBisonParser} - We've instructed Bison to name it's parser class FooBisonParser instead of the default parser.
%define api.location.type {location_t} - We've instructed Bison to use our location type instead of the default one. (see also)
%locations We've instructed Bison to generate the code required to handle locations. This causes declarations of a few methods to get an additional parameter - the location. (This includes yylex.) We will also need to write a new function that calculates the location of a token that is composed of multiple smaller tokens.
%define parse.error detailed - We've instructed Bison to generate more detailed error messages than just "syntax error".
%define parse.trace - We've instructed Bison to generate code that can print debug log during execution.
%verbose - We've instructed Bison to generate an additional output file fooParser.output which contains a human-readable description of the generated state machine. It is very useful as a reference for interpreting debug log.
%parse-param {const bool debug} - We've added an additional parameter to the parser's constructor.
set_debug_level(debug); - We've used the value of new constructor parameter to decide whether to print debug logs. (%initial-action)
#if YYDEBUG != 0 & #endif - This is an additional fail-safe that allows compilation if there is no %define parse.trace. (YYDEBUG)
void calcLocation(location_t &current, const RHS &rhs, const std::size_t n); - This is a function that will get locations of all sub-tokens of a bigger token and it will calculate its location. In our case, we just take the start position of the first token and the end position of the last one.
#define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N) - We've instructed Bison to use our function for calculating locations.
%expect 0 - This line make sure there is no conflicts in the grammar. It is useful for keeping track of how many conflicts we already know of and allowed.
void FooBisonParser::error(const location_t &location, const std::string &message) - The function that prints error messages is now required to also take the location of the error.

main.cc:

#include <cstring>
#include <iostream>
#include "FooLexer.hh"
#include "fooParser.tab.hh"

int main(int argc, char* argv[])
{
    const bool debug = argc > 1 && std::strcmp(argv[1], "--debug") == 0;
    foo::FooLexer lexer(std::cin, debug);
    foo::FooBisonParser parser(lexer, debug);
    return parser();
}

The main change in our main function is that it checks if the program was called with the flag --debug and passes this information to lexer and parser.

We also explicitly pass std::cin as lexer's input. This doesn't change anything in comparison to the previous example but we can easily change that to an std::istream that open a file or even is some internal stream in the program.

Bonus - makefile:

.RECIPEPREFIX = >

prog: main.o fooParser.tab.o lex.yy_foo_.o
> g++ $^ -o $@

main.o: main.cc FooLexer.hh fooParser.tab.hh location_t.hh
> g++ -c $< -o $@

lex.yy_foo_.o: lex.yy_foo_.cc FooLexer.hh fooParser.tab.hh location_t.hh
> g++ -c $< -o $@

fooParser.tab.o: fooParser.tab.cc FooLexer.hh location_t.hh
> g++ -c $< -o $@

lex.yy_foo_.cc: fooLexer.ll
> flex $<

fooParser.tab.hh fooParser.tab.cc fooParser.output: fooParser.yy
> bison $<

.PHONY: clean
clean:
> rm -f prog main.o lex.* fooParser.tab.* fooParser.output

How do you interface C++ flex with C++ Bison?

1 Answers1

A minimal example

An extended example

Linked