3

I'm writing a grammar for parsing a computer language, that can be used with Parse::Eyapp. This is a Perl package that simplifies writing parsers for regular languages. It is similar to yacc and other LALR parser generators, but has some useful extensions, like defining tokens in terms of regular expressions.

The language I want to parse uses keywords to denote sections and describe control flow. It also supports identifiers that serve as placeholders for data. An identifier can never have the same name as a keyword.

Now, here comes the tricky part: I need to separate keywords from identifiers, but they may look similar, so I need a regular expression pattern that matches an identifier case-insensitively, and nothing else.

The solution I came up with is the following:

  1. Each keyword is identified by a token of the following form: /((?i)keyword)(?!\w)/
    • (?i) will apply case-insensitive matching for the following subpattern
    • (?!\w) will not accept any word characters (a-z, 0-9, etc.) after the keyword
    • those characters will not be part of the match
  2. Keywords that are the same as the beginning of another keyword are listed after the longer keyword, so they match first
  3. The token for matching identifiers comes last so it will only match when no keyword is recognized

The token definitions and part of the grammar I came up with work well so far, but there is still a lot to do. However, that is not my question.

What I wanted to ask is, am I on the right track here; are there better, simpler regular expressions for matching those keywords? Should I stop and use a different approach for language parsing altogether?

The idea of using the tokenizer to match whole strings instead of single characters came from the Parse::Eyapp documentation, by the way. I started with a character-by-character grammar first, but that approach wasn't very elegant and seems to contradict the flexible nature of the parser generator. It was very cumbersome to write, too.

nicael
  • 18,550
  • 13
  • 57
  • 90
onitake
  • 1,369
  • 7
  • 14
  • This seems to belong rather on [codereview.se](http://codereview.stackexchange.com/) – Martin Ender Jul 01 '13 at 15:19
  • Probably, but I'm not asking for a code review here. I'd rather have some hints on developing a good tokenizer and (maybe) grammar for a programming language. I guess I should make that clearer. – onitake Jul 01 '13 at 15:22
  • It seems like you're on the right track. When I use lex/flex, I come up with similar patterns for my keywords. The main thing is to make sure to mark "word boundaries" around your keyword (which you're doing) and to match all your keyword tokens before general identifiers. – Joe Z Jul 01 '13 at 15:26
  • Not sure, but maybe you could use `\b` instead of `(?!\w)`. – HamZa Jul 01 '13 at 15:30
  • Thanks for the hint, @HamZa. I guess that could work too (and would be more elegant). – onitake Jul 01 '13 at 15:35

1 Answers1

2

If you would like to parse a language, Marpa maybe much better suited for you. Here's a tutorial. You could also use regexp grammars.

Steve P.
  • 14,489
  • 8
  • 42
  • 72
user1126070
  • 5,059
  • 1
  • 16
  • 15
  • Wow, those both look very powerful. I've spent a few days getting my hands dirty with yacc/yapp, but it's not too late to switch. – onitake Jul 01 '13 at 16:06
  • Marpa has become much easier to use and more powerful since that tutorial was written. More up to date tutorials are http://marpa-guide.github.io/index.html, http://marpa-guide.github.io/index.html and http://marpa-guide.github.io/index.html. – Jeffrey Kegler Jul 01 '13 at 23:45
  • I tried `Regexp::Grammars` so far. Its syntax is more versatile than `Parse::Eyapp` and I like that it basically enhances Perl's regexes. Unfortunately, I hit a [big showstopper](https://rt.cpan.org/Public/Bug/Display.html?id=79149). Let's have a look at Marpa, then. – onitake Jul 02 '13 at 23:21
  • Marpa looks really promising, except that it requires a lot more explicit rules in my case due to the restriction of quantifiers to single-statement rules. In line with my original question, can someone give me a hint how I could implement case-insensitive keyword matching in Marpa? String matches are out, but character classes aren't really helpful either. Do I really have to write something like [kK][eE][yY][wW][oO][rR][dD]? – onitake Jul 08 '13 at 15:30
  • I turned my last comment into [a question](http://stackoverflow.com/questions/17530973/case-insensitive-matching-in-marpa). – onitake Jul 08 '13 at 15:56