How i can disable maximal munch rule in Lex?

Question

Suppose i want to deal with certain patterns and have the other text(VHDL code) as it is in the output file.

For that purpose i would be required to write a master rule in the end as

(MY_PATTERN){
// do something with my pattern
}

(.*){
return TOK_VHDL_CODE;

}

Problem with this strategy is MY_PATTERN is useless in this case and would be matched with .* by maximum munch rule.

So how can i get this functionality ?

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

In theory, it's possible to find a regular expression which will match a string not containing a pattern, but except in the case of very simple patterns, it is neither easy nor legible.

If all you want to do is search for (and react to) specific patterns, you could use a default rule which matches one character and does nothing:

{Pattern1}   { /* Do something with the pattern */ }
{Pattern2}   { /* Do something with the pattern */ }
.|\n         /* Default rule does nothing */

If, on the other hand, you wanted to do something with the not-otherwise-matched strings (as in your example), you'll need to default rule to accumulate the strings, and the pattern rules to "send" (return) the accumulated token before acting on the token which they matched. That means that some actions will need to send two tokens, which is a bit awkward with the standard parser calls scanner for a token architecture, because it requires the scanner to maintain some state.

If you have a no-too-ancient version of bison, you could use a "push parser" instead, which allows the scanner to call the parser. That makes it easy to send two tokens in a single action. Otherwise, you need to build a kind of state machine into your scanner.

Below is a simple example (which needs pattern definitions, among other things) using a push-parser.

%{
  #include <stdlib.h>
  #include <string.h>
  #include "parser.tab.h"
  /* Since the lexer calls the parser and we call the lexer,
   * we pass through a parser (state) to the lexer. This is
   * how you change the `yylex` prototype:
   */
  #define YY_DECL static int yylex(yypstate* parser)
%}

pattern1   ...
pattern2   ...

/* Standard "avoid warnings" options */
%option noyywrap noinput nounput nodefault

%%
  /* Indented code before the first pattern is inserted at the beginning
   * of yylex, perfect for local variables.
   */
  size_t vhdl_length = 0;
  /* These are macros because they do out-of-sequence return on error. */
  /* If you don't like macros, please accept my apologies for the offense. */
  #define SEND_(toke, leng) do { \
    size_t leng_ = leng; \
    char* text = memmove(malloc(leng_ + 1), yytext, leng_); \
    text[leng_] = 0; \
    int status = yypush_parse(parser, toke, &text); \
    if (status != YYPUSH_MORE) return status; \
  } while(0);
  #define SEND_TOKEN(toke) SEND_(toke, yyleng)
  #define SEND_TEXT do if(vhdl_length){ \
    SEND_(TEXT, vhdl_length); \
    yytext += vhdl_length; yyleng -= vhdl_length; vhdl_length = 0; \
  } while(0);

{pattern1}   { SEND_TEXT; SEND_TOKEN(TOK_1); }
{pattern2}   { SEND_TEXT; SEND_TOKEN(TOK_2); }
  /* Default action just registers that we have one more char 
   * calls yymore() to keep accumulating the token.
   */
.|\n      { ++vhdl_length; yymore(); }
  /* In the push model, we're responsible for sending EOF to the parser */
<<EOF>>   { SEND_TEXT; return yypush_parse(parser, 0, 0); }

%%

/* In this model, the lexer drives everything, so we provide the
 * top-level interface here.
 */

int parse_vhdl(FILE* in) {
  yyin = in;
  /* Create a new pure push parser */
  yypstate* parser = yypstate_new();
  int status = yylex(parser);
  yypstate_delete(parser);
  return status;
}

To actually get that to work with bison, you need to provide a couple of extra options:

parser.y

%code requires {
  /* requires blocks get copied into the tab.h file */
  /* Don't do this if you prefer a %union declaration, of course */
  #define YYSTYPE char*
}
%code {
  #include <stdio.h>
  void yyerror(const char* msg) { fprintf(stderr, "%s\n", msg); }
}

%define api.pure full
%define api.push-pull push

score 1 · Accepted Answer · answered Dec 30 '14 at 23:44

1

The easy way is to get rid of the * in your default rule at the end and just use

.    { append_to_buffer(*yytext); }

so your default rule takes all the stuff that isn't matched by the previous rules and stuffs it off in a buffer somehwere to be dealt with by someone else.

answered Dec 30 '14 at 23:44

Chris Dodd

119,907
13
134
226

But Chris ,don't you think the parsing will be too slow in this case as it will have to consider each and every letter as a valid token and pass it to the grammar further to deal with the whole text ? – Ankur Gautam Jan 04 '15 at 09:06
I tried populating text word by word ,i mean by detecting the space.But it will fail in the case when two of the valid tokens are not separated by space(This is possible as the input is coming from users side) ? – Ankur Gautam Jan 04 '15 at 09:09

How i can disable maximal munch rule in Lex?

2 Answers2

parser.y

Linked