18

Another simple question : is there any way to tell flex to prefer a rule that matches a short thing over a rule that matches a longer thing ? I can't find any good documentation about that.

Here is why I need that : I parse a file for a pseudo language that contains some keywords corresponding to control instructions. I'd like them to be the absolute priority so that they're not parsed as parts of an expression. I actually need this priority thing because I don't have to write a full grammar for my project (that would be totally overkill in my case since I perform structural analysis on the program parsed, I don't need to know the details...), so I can't use a fine grammar tuning to be sure that those blocks won't be parsed into an expression.

Any help will be appreciated.

Here is an example of a file parsed :

If a > 0 Then read(b); Endif
c := "If I were...";
While d > 5 Do d := d + 1 Endwhile

I just want to collect info on the Ifs, Thens, Endifs etc... The rest doesn't matter to me. That's why I'd like the Ifs, Thens etc... related rules to be prioritized without to have to write a grammar.

Lesmana
  • 25,663
  • 9
  • 82
  • 87
m09
  • 7,490
  • 3
  • 31
  • 58
  • Would you please show an example file? How do your pseudo-language and its "control instructions" look like? What dou you mean by "as parts of an expression"? What do you do if you find a "control instruction"? And what do you do with the rest of file? Are the files to be parsed text files or binary files? – kol Dec 07 '11 at 00:53

1 Answers1

16

From the Dragon Book 2nd edition, Section 3.5.3 "Conflict Resolution in Lex":

We have alluded to the two rules that Lex uses to decide on the proper lexeme
to select, when several prefixes of the input match one or more patterns:
    1. Always prefer a longer prefix to a shorter prefix.
    2. If the longest possible prefix matches two or more patterns, prefer the
       pattern listed first in the Lex program.

The rule above also applies to Flex. Here is what the Flex manual says (Chapter 7: How the input is matched.)

When the generated scanner is run, it analyzes its input looking for strings 
which match any of its patterns. If it finds more than one match, it takes the 
one matching the most text (for trailing context rules, this includes the length 
of the trailing part, even though it will then be returned to the input). If it 
finds two or more matches of the same length, the rule listed first in the flex 
input file is chosen.

If I understood correctly, your lexer treats keywords like Endif as an identifier, so it will be considered as part of an expression afterwards. If this is your problem, simply put the rules of keywords on top of your specification, such as the following: (suppose each word in uppercase is a predefined enum corresponding to a token)

"If"                      { return IF;         }
"Then"                    { return THEN;       }
"Endif"                   { return ENDIF;      }
"While"                   { return WHILE;      }
"Do"                      { return DO;         }
"EndWhile"                { return ENDWHILE;   }
\"(\\.|[^\\"])*\"         { return STRING;     }
[a-zA-Z_][a-zA-Z0-9_]*    { return IDENTIFIER; }

Then the keywords will always matched before the identifier due to Rule No. 2.

EDIT:

Thank you for your comment, kol. I forgot to add the rule for string. But I don't think my solution is wrong. for example, if an identifier called If_this_is_an_identifier, rule 1 will apply, thus the identifier rule will take effect (Since it matches the longest string). I wrote a simple test case and saw no problem in my solution. Here is my lex.l file:

%{
  #include <iostream>
  using namespace std;
%}

ID       [a-zA-Z_][a-zA-Z0-9_]*

%option noyywrap
%%

"If"                      { cout << "IF: " << yytext << endl;         }
"Then"                    { cout << "THEN: " << yytext << endl;       }
"Endif"                   { cout << "ENDIF: " << yytext << endl;      }
"While"                   { cout << "WHILE: " << yytext << endl;      }
"Do"                      { cout << "DO: " << yytext << endl;         }
"EndWhile"                { cout << "ENDWHILE: " << yytext << endl;   }
\"(\\.|[^\\"])*\"         { cout << "STRING: " << yytext << endl;     }
{ID}                      { cout << "IDENTIFIER: " << yytext << endl; }
.                         { cout << "Ignore token: " << yytext << endl; }

%%

int main(int argc, char* argv[]) {
  ++argv, --argc;  /* skip over program name */
  if ( argc > 0 )
    yyin = fopen( argv[0], "r" );
  else
    yyin = stdin;

  yylex();
}

I tested my solution with the following test case:

If If_this_is_an_identifier > 0 Then read(b); Endif
    c := "If I were...";
While While_this_is_also_an_identifier > 5 Do d := d + 1 Endwhile

and it gives me the following output (other output not relevant to the problem you mentioned is ignored.)

IF: If
IDENTIFIER: If_this_is_an_identifier
......
STRING: "If I were..."
......
WHILE: While
IDENTIFIER: While_this_is_also_an_identifier

The lex.l program is modified base on an example from the flex manual: (which use the same method to match keyword out of identifiers)

Also have a look at the ANSI C grammar, Lex specification.

I also used this approach in my personal project, and so far I didn't find any problem.

rici
  • 234,347
  • 28
  • 237
  • 341
Lei Mou
  • 2,562
  • 1
  • 21
  • 29
  • This does not work. For example, the "If" pattern will be found not just in case of the "If" keyword, but also in identifiers and strings which contain the substring "If". – kol Dec 07 '11 at 05:16
  • +1 I deleted my answer, because it was unnecessarily complicated. You helped me understand that adding a rule for identifiers can be useful even if you need to identify keywords only - thanks. – kol Dec 07 '11 at 09:54
  • 1
    thanks for the time you took to write this answer but 1) lex won't prefer the earliest to the longest, it will prefer the earliest OF THE LONGEST matched, that's the meaning of rule2. 2) it's shown in your test cases. That's precisely what I want to avoid : I'd want the If in your indentifiers and strings to be returned as Ifs. – m09 Dec 07 '11 at 09:57
  • @Mog Then I need to improve my English. :-). I'll try to figure it out later. – Lei Mou Dec 07 '11 at 10:19
  • @Mog I was a little confused when reading your post again. What do you want to achieve exactly? Collecting all the occurrence of keywords (and probably something more) and then analysis the structure of your program? Could you please provide an example, showing under what circumstances does your program fails? – Lei Mou Dec 07 '11 at 12:45
  • 1
    Well actually it was more of a "want to know if possible" question since I solved my problem by adding a requirement to expressions (no space) so that my pseudo language isn't hard to parse. I admit that the example I gave when kol asked one was really bad, so sorry about that. I'll just give you the bounty and let this question die, it doesn't seem possible anyway. Thanks for your time ! – m09 Dec 07 '11 at 12:54