I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:
// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4.cs
bool IsUserDefinedFunction()
{
foreach (string function in listOfUserDefinedFunctions)
{
if (this.Text == function)
{
return true;
}
}
return false;
}
However, I've found that just having the semantic predicate {IsUserDefinedFunction()}?
makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction()
to always return false
had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?
A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.
For example: Given the language defined keyword GOTO
and a user-defined function GOTO20Something
, a typical piece of program text could look like:
GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something
and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION
Edit to clarify:
Even rewriting IsUserDefinedFunction()
as:
bool IsUserDefinedFunction() { return false; }
I still get the same slow performance.
Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:
// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;
// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
: 'ActualUserKeyword'
| 'AnotherActualUserKeyword'
| 'MoreKeywords'
...
;
Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?