3

I've seen two approaches to building parsers in Scala.

The first is to extends from RegexParsers and define your won lexical patterns. The issue I see with this is that I don't really understand how it deals with keyword ambiguities. For example, if my keyword match the same pattern as idents, then it processes the keywords as idents.

To counter that, I've seen posts like this one that show how to use the StandardTokenParsers to specify keywords. But then, I don't understand how to specify the regexp patterns! Yes, StandardTokenParsers comes with "ident" but it doesn't come with the other ones I need (complex floating point number representations, specific string literal patterns and rules for escaping, etc).

How do you get both the ability to specify keywords and the ability to specify token patterns with regular expressions?

Community
  • 1
  • 1
Michael Tiller
  • 9,291
  • 3
  • 26
  • 41

2 Answers2

9

I've written only RegexParsers-derived parsers, but what I do is something like this:

val name: Parser[String] = "[A-Z_a-z][A-Z_a-z0-9]*".r

val kwIf: Parser[String]    = "if\\b".r
val kwFor: Parser[String]   = "for\\b".r
val kwWhile: Parser[String] = "while\\b".r

val reserved: Parser[String] = ( kwIf | kwFor | kwWhile )

val identifier: Parser[String] = not(reserved) ~> name
Randall Schulz
  • 26,420
  • 4
  • 61
  • 81
  • I had seen this suggestion before and tried it, but had problems where it seemed to be consuming the token qualified with the not(...). But, I just tried it again and it does work. Thanks! – Michael Tiller Sep 22 '10 at 16:04
  • What is the point of the "\b" in the regexps? Surely you don't encode backspaces in your input language?!? – Michael Tiller Sep 22 '10 at 16:05
  • Corrected. I meant to specify a word boundary. Otherwise you match (pseudo-) keywords that appear as the prefix of legitimate identifiers. – Randall Schulz Sep 22 '10 at 16:09
  • OK, here is another update...based on my testing the definition of "reserved" isn't even necessary! It seems as though just defining the parsers for keywords (e.g. kwIf) does *something* (probably inside the implicit def) to change the tokenizing?!? Odd, but I've confirmed this quite explicitly. Can anybody explain this? – Michael Tiller Sep 22 '10 at 16:29
  • You'll have to be more explicit. Perhaps start a new question with the code that illustrates the phenomenon you're seeing. Or edit this one, if you think that makes more sense. But keep in mind that everything in a combinator parser is top-down. There's no state machine built from a spec, either at the lexical / regular level or at the level of the CFG productions. – Randall Schulz Sep 22 '10 at 17:21
0

Similar to the answer from @randall-schulz, but use an explicit negative lookahead in the regular expression itself.

Here, empty is a keyword but empty? should be an identifier. The negative lookahead fails the match (without consuming the characters) if empty is followed by anything in nameCharsRE. The kw helper function is used for multiple such keywords:

  val nameCharsRE = "[^\\s\",'`()\\[\\]{}|;#]"

  private def kw(kw: String, token: Token) = positioned {
    (s"${kw}(?!${nameCharsRE})").r ^^ { _ => token }
  }
  private def empty        = kw("empty", EMPTY_KW())
  private def and          = kw("and", AND())
  private def or           = kw("or", OR())
bwbecker
  • 1,031
  • 9
  • 21