I'm writing a grammar for parsing a computer language, that can be used with Parse::Eyapp. This is a Perl package that simplifies writing parsers for regular languages. It is similar to yacc and other LALR parser generators, but has some useful extensions, like defining tokens in terms of regular expressions.
The language I want to parse uses keywords to denote sections and describe control flow. It also supports identifiers that serve as placeholders for data. An identifier can never have the same name as a keyword.
Now, here comes the tricky part: I need to separate keywords from identifiers, but they may look similar, so I need a regular expression pattern that matches an identifier case-insensitively, and nothing else.
The solution I came up with is the following:
- Each keyword is identified by a token of the following form:
/((?i)keyword)(?!\w)/
(?i)
will apply case-insensitive matching for the following subpattern(?!\w)
will not accept any word characters (a-z, 0-9, etc.) after the keyword- those characters will not be part of the match
- Keywords that are the same as the beginning of another keyword are listed after the longer keyword, so they match first
- The token for matching identifiers comes last so it will only match when no keyword is recognized
The token definitions and part of the grammar I came up with work well so far, but there is still a lot to do. However, that is not my question.
What I wanted to ask is, am I on the right track here; are there better, simpler regular expressions for matching those keywords? Should I stop and use a different approach for language parsing altogether?
The idea of using the tokenizer to match whole strings instead of single characters came from the Parse::Eyapp documentation, by the way. I started with a character-by-character grammar first, but that approach wasn't very elegant and seems to contradict the flexible nature of the parser generator. It was very cumbersome to write, too.