4

Suppose I need simple grammar that describes language like

foo 2
bar 21

but not

foo1

Using jflex I wrote smt like

WORD=[a-zA-Z]+
NUMBER=[0-9]+
WHITE_SPACE_CHAR=[\ \n\r\t\f]

%state AFTER_WORD
%state AFTER_WORD_SEPARATOR

%%
<YYINITIAL>{WORD}               { yybegin(AFTER_WORD); return TokenType.WORD; }        
<AFTER_WORD>{WHITE_SPACE_CHAR}+ { yybegin(AFTER_WORD_SEPARATOR); return TokenType.WHITE_SPACE; }        
<AFTER_WORD_SEPARATOR>{NUMBER}  { yybegin(YYINITIAL); return TokenType.NUMBER; }        

{WHITE_SPACE_CHAR}+             { return TokenType.WHITE_SPACE; }

But I dont like extra states that used for saying that there should be whitespace between word and digit. How I can simplify my grammar?

Stan Kurilin
  • 15,614
  • 21
  • 81
  • 132

2 Answers2

4

You shouldn't need white space tokens when parsing at all.

Get rid of TokenType.WHITE_SPACE, and when you get white space in the lexer, just ignore it instead of returning anything.

To prevent 'foo1', add another rule for [A-Za-z0-9] and another token type for it that doesn't appear in the grammar; then it's a syntax error.

AdrieanKhisbe
  • 3,899
  • 8
  • 37
  • 45
user207421
  • 305,947
  • 44
  • 307
  • 483
  • Seems like true. But I'm actually need thoose whitespaces. Since I develop also plugin for IDE, and all elements are valueble. – Stan Kurilin Jan 30 '13 at 11:35
  • Which IDE are you targeting? – Bastien Jansen Jan 30 '13 at 12:43
  • 2
    Then you should not get rid of TokenType.WHITE_SPACE as @EJP suggested, because it is needed in your `ParserDefinition`. The JFlex snippet I suggested in my answer should work. Then in your parser you will write the logic that checks if an identifier is followed by a number. I suggest you take a look at this very nice tutorial, if you haven't done it yet: http://confluence.jetbrains.com/display/IntelliJIDEA/Custom+Language+Support There you will learn how to write a parser using Grammar-Kit, which is a very helpful tool :) – Bastien Jansen Jan 30 '13 at 12:56
  • @Nebelmann Ok. Thanks. Of course I saw that tutorial. But since I need also separated from Idea parser I've decided to combine lexer more powerfull. It was a mistake. – Stan Kurilin Jan 30 '13 at 13:28
1

From what I know of JFlex, if you are recognizing whitespaces corectly (which seems to be the case), you don't have to use extra states. Just make a rule for "identifiers", and another one for "numbers".

%%
{WORD}    { return TokenType.WORD; }
{NUMBER}  { return TokenType.NUMBER; }

If your language imposes each line to be consisted of exactly one identifier, one space and one number, this should be checked by syntactic analysis (i.e. by a parser), not lexical analysis.

Bastien Jansen
  • 8,756
  • 2
  • 35
  • 53