2

I am writing a flex program to deal with string constants.

I want to return an ERROR token when the input file meets EOF inside a string.

I got the following error after the file meets EOF and "ERROR" is printed:

fatal flex scanner internal error--end of buffer missed

Here is my code: (a simplified version which can reproduce this error)

%option noyywrap
    #define ERROR 300
    #define STRING 301
    char *text;
%x str

%%

\"            {BEGIN(str); yymore();}
<str>\"       {BEGIN(INITIAL); text=yytext; return STRING;}
<str>.        {yymore();}
<str><<EOF>>  {BEGIN(INITIAL); return ERROR;}

%%

int main(){
    int token;
    while((token=yylex())!=0){
        if(token==STRING)
            printf("string:%s\n",text);
        else if(token==ERROR)
            printf("ERROR\n");
    }
    return 0;
}

When I delete the yymore() function call, the error disappeared and the program exited normally after printing "ERROR".

I wonder why this happens and I want to solve it without removing yymore().

splash
  • 13,037
  • 1
  • 44
  • 67
zzh1996
  • 480
  • 4
  • 13
  • I am actually not getting the error you said for the given reproducible example. Can you mention which version of flex you were using? – Sourav Kannantha B Dec 07 '22 at 20:08
  • 1
    @sourav: You probably just didn't provide the same input. OP's example only produces the fatal flex error if the input terminates inside a string *without a newline*. Since the newline is not recognised by any rule in the `` start condition, the default rule is triggered. The default rule does not call `yymore()`, so yymore is not active when EOF is reached. The fatal flex error occurs only if yymore is active when EOF is reached (and the EOF action is invoked twice). – rici Dec 07 '22 at 21:54

1 Answers1

4

You cannot continue the lexical scan after you receive an EOF indication, so your <str><<EOF>> rule is incorrect, and that is what the error message indicates.

As with any undefined behaviour, there are circumstances in which the error may lead to arbitrary behaviour, including working as you incorrectly assumed it would work. (With your flex version, this happens if you don't use yymore, for example.)

You need to ensure that the scanner loop is not reentered after the EOF is received. You could, for example, return an error code which indicates that no more tokens can be read (as opposed to a restartable error indication, if needed.) Or you could set a flag for the lexer which causes it to immediately return 0 after an unrecoverable error.

Here's an example of the second strategy (just the rules, since nothing else changes):

%%
              /* indented code before the first pattern is inserted
               * at the beginning of yylex, allowing declaration of 
               * variables. The fatal_error flag is declared static,
               * since this is not a reentrable lexer. If it were 
               * reentrable, we'd put the flag in the lexer context
               * (as part of the "extra data"), which would be a lot cleaner.
               */
              static int fatal_error = 0;
              /* If the error we last returned was fatal, we do
               * not re-enter the scanner loop; we just return EOF
               */
              if (fatal_error) {
                  fatal_error = 0; /* reset the flag for reuse */
                  return 0;
              }

\"            {BEGIN(str); yymore();}
<str>\"       {BEGIN(INITIAL); text=yytext; return STRING;}
<str>.        {yymore();}
<str><<EOF>>  {BEGIN(INITIAL);
               fatal_error = 1; /* Set the fatal error flag */
               return ERROR;}

%%

Another possible solution is to use a "push parser", where yylex calls the parser with each token, instead of the other way round. bison supports this style, and it's often a lot more convenient; in particular, it allows an action to send more than one token to the parser, which in the case would obviate the need for a static local flag.

rici
  • 234,347
  • 28
  • 237
  • 341
  • I am actually not getting the said error for the given reproducible example. Is this question and answer still valid?? – Sourav Kannantha B Dec 07 '22 at 20:09
  • 1
    @Sourav: As I say in the answer, undefined behaviour is just undefined, not a particular error message. The precise circumstances under which each particular behaviour manifests may be complicated, inconsistent and/or dependent on external circumstances. The OP is not really a [mre] because it fails to specify the precise parser input which lead to the error, but as it happens, nothing has changed since this question was asked and had you correctly guessed the input, you would have received the same result. – rici Dec 07 '22 at 21:59
  • 1
    Try running OP's program (here called `jam`) like this: `printf %s '"' | ./jam`. Perhaps I should edit that into the question, for posterity. – rici Dec 07 '22 at 22:01
  • Thanks, I was able to reproduce the error. But surprisingly, error won't occur if I use `yy_scan_string("\"\0\0");`. (I used this before I commented above). – Sourav Kannantha B Dec 07 '22 at 22:42
  • 1
    @Sourav: With that input, the quoted string is terminated (which also finishes the `yymore`), so I don't see why you find it surprising that there is no error. (The NUL characters, like any other character outside a quoted string, are not explicitly handled by OP's lexer, but the default fallback does not return anything to yylex's caller, so no error token is produced.) – rici Dec 07 '22 at 23:41
  • 1
    Also, if you were thinking that you needed to provide two NULs to satisfy yylex's input requirements, that's only true for `yy_scan_buffer()`, which avoids making a copy of the provided string by requiring the idiosyncratic double NUL termination. `yy_scan_string` and `yy_scan_bytes` add the double-NUL terminator themselves when they copy the provided argument. – rici Dec 07 '22 at 23:45