1

I'm writing a parser which has some tokens that are concatenated from multiple smaller rules, using yymore(). If it reaches EOF before the end of this composite token, I need it to return a special error-token to the parser. This is the same problem as in this question. The answer there suggests to convert the parser to a "push parser" to solve this.

The Bison manual makes it pretty clear how to make a push parser part but I cannot find a similar instruction on how the lexer should look.

Let's take the following lexer:

%option noyywrap

%{
    #include <string.h>
    
    // Stub of the parser header file:
    #define GOOD_STRING 1000
    #define BAD_STRING 1001
    char *yylval;
%}

%x STRING

%%

\"           { BEGIN(STRING); yymore(); }
<STRING>{
    \"       { BEGIN(INITIAL); yylval = strdup(yytext); return GOOD_STRING; }
    .|\n     { yymore(); }
    <<EOF>>  { BEGIN(INITIAL); yylval = strdup(yytext); return BAD_STRING; }
}

.|\n         { return yytext[0]; }

%%

void parser_stub()
{
    int token;
    while ((token = yylex()) > 0) {
        if (token < 1000) {
            printf("%i '%c'\n", token, (char)token);
        } else {
            printf("%i \"%s\"\n", token, yylval);
            free(yylval);
        }
    }
}

int main(void)
{
    parser_stub();
}

It doesn't work as a pull-parser because it continues parsing after encountering EOF, which ends in an error: fatal flex scanner internal error--end of buffer missed. (It works if yymore() is not used but it still technically is an undefined behavior.) In the rule <<EOF>> it needs to emit 2 tokens: BAD_STRING and 0.

How do you convert a lexer into one suitable for a push-parser?

I'm guessing it involves replacing returns with something that pushes a token to the parser without ending yylex() but I haven't found a mention of such function / macro. Is this just a case of having to implement it manually, without any support built-in into Flex?

Piotr Siupa
  • 3,929
  • 2
  • 29
  • 65

1 Answers1

1

Instead of setting yylval and returning the token id, you call the push parser with the token id and semantic value as arguments. That's it. Flex provides nothing to help you write the return or the call, but that doesn't really seem like a big deal to me.

Sometimes it's convenient to use your own macro in both cases, to reduce boilerplate. I tend to do that. For push parsing, the main issue is error handling, since the push parser might return an error on any call.

By the way, even in stripped-down example code, you should never pass yytext directly to the parser. That's probably the number one cause of mysterious lexing bugs. Also, Bison assigns token numbers for you. Putting your own definitions in the lexer implementation will almost inevitably lead to grief.

Here are some examples of answers I've written over the years with push lexers:

This example uses Flex with Lemon (another push parser):

There are probably more :-)

Piotr Siupa
  • 3,929
  • 2
  • 29
  • 65
rici
  • 234,347
  • 28
  • 237
  • 341
  • I expected that answer but better to ask before implementing own solutions. My real lexer is build with the `%option c++` so adding custom "macros" to handle the boiler plate will be pretty easy. I'll just add methods to my lexer class. (and yeah, the rough handling of `yytext` and the token constants is only in the example. It's hard to balance good code vs making it as short as possible.) – Piotr Siupa Sep 11 '22 at 17:35
  • @NO_NAME: Since it's known that people copy code from SO into their incipient projects, it's always wise to avoid usages which shouldn't be copied.. `strdup(yytext)` only adds 8 keystrokes :-) – rici Sep 11 '22 at 20:56
  • They should know better than to copy from the question but ok, I did write the code works. I've fixed the problems you've mentioned and I hope I haven't introduced any memory leaks in the process. Btw, it turns up Bison doesn't support push-parsers for C++ :-( – Piotr Siupa Sep 12 '22 at 09:32
  • @NO_NAME: that's true, although it does generate them in C, D and Java. C++ has been on the TODO list for some time, marked as an easy task, but apparently no-one has come forward to do it. Still, you never know. – rici Sep 12 '22 at 18:08
  • @NO_NAME: alternatively, you should be able to invert control with C++ coroutines. See, for example, [this answer](https://stackoverflow.com/a/70803344/1566221) – rici Sep 12 '22 at 18:34
  • That's a neat trick but a little complicated and I'm using C++17. I'll opt for a simple boolean flag, as you described in [this answer](https://stackoverflow.com/a/39666639/3052438) and hope C++ support for push-parser will be added sooner rather than later. – Piotr Siupa Sep 12 '22 at 18:55