5

For languages with keywords, some special trickery needs to happen to prevent for example "if" from being interpreted as an identifier and "ifSomeVariableName" from becoming keyword "if" followed by identifier "SomeVariableName" in the token stream.

For recursive descent and Lex/Yacc, I've simply taken the approach (as per helpful instruction) of transforming the token stream between the lexer and the parser.

However, FParsec doesn't really seem do a separate lexer step, so I'm wondering what the best way to deal with this is. Speaking of, it seems like Haskell's Parsec supports a lexer layer, but FParsec does not?

pad
  • 41,040
  • 7
  • 92
  • 166
Hans
  • 2,230
  • 23
  • 23

2 Answers2

6

I think, this problem is very simple. The answer is that you have to:

  1. Parse out an entire word ([a-z]+), lower case only;
  2. Check if it belongs to a dictionary; if so, return a keyword; otherwise, the parser will fall back;
  3. Parse identifier separately;

E.g. (just a hypothetical code, not tested):

let keyWordSet =
    System.Collections.Generic.HashSet<_>(
        [|"while"; "begin"; "end"; "do"; "if"; "then"; "else"; "print"|]
    )
let pKeyword =
   (many1Satisfy isLower .>> nonAlphaNumeric) // [a-z]+
   >>= (fun s -> if keyWordSet.Contains(s) then (preturn x) else fail "not a keyword")

let pContent =
    pLineComment <|> pOperator <|> pNumeral <|> pKeyword <|> pIdentifier

The code above will parse a keyword or an identifier twice. To fix it, alternatively, you may:

  1. Parse out an entire word ([a-z][A-Z]+[a-z][A-Z][0-9]+), e.g. everything alphanumeric;
  2. Check if it's a keyword or an identifier (lower case and belonging to a dictionary) and either
    1. Return a keyword
    2. Return an identifier

P.S. Don't forget to order "cheaper" parsers first, if it does not ruin the logic.

Be Brave Be Like Ukraine
  • 7,596
  • 3
  • 42
  • 66
  • I like your second process. It basically does the same thing as the lexer post-processor trick, but is just inline instead. In 20/20 hindsight, that's the most obvious solution :). Thanks – Hans Feb 04 '14 at 00:01
  • 1
    The definition of `pKeyword` in the above answer is confusing me. My type inference shows that it is a `Parser` which is not (IMHO) what you want - you want to return a `Parser` or a failure wrapped in a `Reply` type and I can't see how to achieve that with the `|>>` operator? – Sam Aug 11 '15 at 10:29
  • @Sam, thanks for pointing this out. I have updated the guard rule. Depending on the domain-specifics, one may also need wrapping the parser(s) into `attempt`. Hope this helps. – Be Brave Be Like Ukraine Aug 11 '15 at 21:53
0

You can define a parser for whitespace and check if keyword or identifier is followed by it. For example some generic whitespace parser will look like

let pWhiteSpace = pLineComment <|> pMultilineComment <|> pSpaces

this will require at least one whitespace

let ws1 = skipMany1 pWhiteSpace

then if will look like

let pIf = pstring "if" .>> ws1
sdgfsdh
  • 33,689
  • 26
  • 132
  • 245
EugeneK
  • 2,164
  • 1
  • 16
  • 21