1

I'm working on parsing a language that will have user-defined function calls. At parse time, each of these identifiers will already be known. My goal is to tokenize each instance of a user-defined identifier during the lexical analysis stage. To do so, I've used a method similar to the one in this answer with the following changes:

// Lexer.g4
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;


// Lexer.g4.cs
bool IsUserDefinedFunction()
{
    foreach (string function in listOfUserDefinedFunctions)
    {
        if (this.Text == function)
        {
            return true;
        }
    }
    return false;
}

However, I've found that just having the semantic predicate {IsUserDefinedFunction()}? makes parsing extremely slow (~1-20 ms without, ~2 sec with). Defining IsUserDefinedFunction() to always return false had no impact, so I'm positive the issue is in the parser. Is there anyway to speed up the parsing of these keywords?

A major issue with the language being parsed is that it doesn't use a lot of whitespace between tokens, so a user defined function might begin with a language defined keyword.

For example: Given the language defined keyword GOTO and a user-defined function GOTO20Something, a typical piece of program text could look like:

GOTO20
GOTO30
GOTO20Something
GOTO20GOTO20Something

and should be tokenized as GOTO NUMBER GOTO NUMBER USER_FUNCTION GOTO NUMBER USER_FUNCTION

Edit to clarify:

Even rewriting IsUserDefinedFunction() as:

bool IsUserDefinedFunction() { return false; }

I still get the same slow performance.

Also, to clarify, my performance baseline is compared with "hard-coding" the dynamic keywords into the Lexer like so:

// Lexer.g4 - Poor Performance (2000 line input, ~ 2 seconds)
USER_FUNCTION : [a-zA-Z0-9_]+ {IsUserDefinedFunction()}?;

// Lexer.g4 - Good Performance (2000 line input, ~ 20 milliseconds)
USER_FUNCTION
    :   'ActualUserKeyword'
    |   'AnotherActualUserKeyword'
    |   'MoreKeywords'
    ...
    ;

Using the semantic predicate provides the correct behavior, but is terribly slow since it has to be checked for every alphanumeric character. Is there another way to handle tokens added at runtime?

Community
  • 1
  • 1
Harrison Paine
  • 611
  • 5
  • 14
  • Have you tried a hashtable not a list search? – Terence Parr Apr 16 '14 at 20:35
  • @TheANTLRGuy, thanks but, as I mentioned, changing the body of `IsUserDefinedFunction()` to `return false` caused the exact same slowdown, leading me to believe it's caused by the presence of the semantic predicate – Harrison Paine Apr 16 '14 at 20:44

2 Answers2

1

Edit: In response to there not being any other identifiers in this language, I would take a different approach.

  1. Use the original grammar, but remove the semantic predicate altogether. This means both valid and invalid user-defined function identifiers will result in USER_FUNCTION tokens.
  2. Use a listener or visitor after the parse is complete to validate instances of USER_FUNCTION in the parse tree, and report an error at that time if the code uses a function that has not been defined.

This strategy results in better error messages, greatly improves the ability of the lexer and parser to recover from these types of errors, and produces a usable parse tree from file (even through it's not completely semantically valid, it can still be used for analysis, reporting, and potentially to support IDE features down the road).


Original answer assuming that identifiers which are not USER_FUNCTION should result in IDENTIFIER tokens.

The problem is the predicate is getting executed after every letter, digit, and underscore during the lexing phase. You can improve performance by declaring your USER_FUNCTION as a token (and removing the USER_FUNCTION rule from the grammar):

tokens {
  USER_FUNCTION
}

Then, in the Lexer.g4.cs file, override the Emit() method to perform the test and override the token type if necessary.

public override IToken Emit() {
  if (_type == IDENTIFIER && IsUserDefinedFunction())
    _type = USER_FUNCTION;

  return base.Emit();
}
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • The language doesn't actually match `IDENTIFIER`s, the only valid ones are known at runtime. The basic tokens are keywords and numeric values, and `USER_FUNCTION`s, which should behave as keywords. The input format is not necessarily delimited (by a token separator or whitespace), so the issue is finding a valid identifier when, for example "GOTO" is a keyword but "GOTO20Something" is a pre-defined identifier. I'm trying to find a solution short of treating every `[a-zA-Z0-9_]` as a token. – Harrison Paine Apr 17 '14 at 02:12
  • Matching all identifier-like strings still gives me issues with the language's lack of spacing. The line `GOTO20something` should be tokenized as `GOTO NUMBER 'something'` normally and `USER_FUNCTION` if 'GOTO20Something' is in the list of pre-defined functions. The `IDENTIFIER` rule is too greedy in this case. I'm going to rework the initial question to make this a little clearer. – Harrison Paine Apr 17 '14 at 12:10
1

My solution for this specific language was to use a System.Text.RegularExpressions.Regex to surround all instances of user-defined functions in the input string with a special character (I chose the § (\u00A7) character).

Then the lexer defines:

USER_FUNCTION : '\u00A7' [a-zA_Z0-9_]+ '\u00A7';

In the parser listener, I strip the surrounding §'s from the function name.

Harrison Paine
  • 611
  • 5
  • 14