The most efficient lookahead substitute for jflex

Question

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.

Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.

REGEX = [:letter:]+\-[:letter:]\.

From string interferon-alpha it would match interferon-al. Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.

In the case of interferon-a, whitespace would be pushed back and interferon returned.

However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.

rici · Answer 1 · 2019-07-24T15:49:17.840

JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:

In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…

So you could certainly write the rule:

[:letter:]+\-[:letter:]/\s

However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):

The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.

So the lookahead operator can only be used in a pattern rule.

Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).

Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:

[:letter:]+\-[:letter:]/\W

or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).

Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.

[:letter:]+(-[:letter:])?   { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+    { /* matches only 'interferon' from 'interferon-alpha' */ }

Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

I am not sure if this is Maven problem, but I am not able to compile anything with / or $ - e.g. your regular expression causes a build error. This is also why I thought that JFlex does not support lookaheads. — matwasilewski, Jul 24 '19 at 15:34
@santiagonasar: You're right, my mistake. JFlex doesn't allow `/` in macros; I fixed the example and added a note. Also see the new note at the end of the answer. — rici, Jul 24 '19 at 15:44

The most efficient lookahead substitute for jflex

1 Answers1