0

I need a regex in jFlex to match a string literal, containing some characters, followed by a hyphen which is followed by a word. However, there are a few hardcoded exceptions. My jFlex version is 1.6.1

My regexes are:

SUFFIXES = labeled|deficient
ALPHANUMERIC = [:letter:]|[:digit:]
AVOID_SUFFIXES = {SUFFIXES} | !({ALPHANUMERIC}+)
WORD = ({ALPHANUMERIC}+([\-\/\.]!{AVOID_SUFFIXES})*)

String "MXs12-labeled" should be tokenized into 'MXs12', '-', 'labeled' (hyphen caught by different regex later), and "MXs12-C123" into 'MXs12-C123' as C123 is not on list of suffixes.

However, the token I obtain is "MXs12-labele" - one letter short of the one forbidden by exception.

An obvious solution would be including additional non {ALPHANUMERIC} character in the regex, but that would add this character to the match too.

Another solution seemed to be to use a negative lookahead, but they return a syntax error every time I try to parse them - jFlex seems not to supports it. (Flex seems do not support a regex lookahead assertion (the fast lex analyzer))

Does anyone know how to solve this in jFlex?

Seki
  • 11,135
  • 7
  • 46
  • 70
matwasilewski
  • 384
  • 2
  • 11
  • 1
    Please take a moment to step back and assess the understandability of your question from our perspective. It is very difficult to understand what you want without a sample input and expected output. It sounds like you've tried some regex so please share what you've tried. Thank you. – MonkeyZeus Jul 17 '19 at 17:22
  • How do you expect `MXs12-labeledblack` to be tokenized? (There are *always* corner cases. Getting patterns right means trying to think about all possibilities.) – rici Jul 17 '19 at 19:46

2 Answers2

2

As you've observed, it's much easier to work with positive matches than with negative matches. (Clearly, labele does not match labeled, and furthermore it's the longest prefix of labeled which doesn't match labeled, so it's logical that if you try to match a word which is !labeled, you'll get labele as a match.

JFlex does not implement negative lookahead assertions, which are slightly different but still problematic. A negative lookahead assertion would certainly reject the suffix in MXs12-labeled, but it would also reject the suffix in MXs12-labeledblack, which would be a bit surprising, I think.

If you rephrase this with positive matches, though, it's really simple. The idea is to specify what needs to be done with every positive match. In this case, what we'll want to do with the positive match of -labeled is to put it back into the input stream, which can be done with yypushback. That would suggest rules something like this:

{ALPHANUMERIC}+ ({DELIMITER}{ALPHANUMERIC}+)* "-labeled"  { yypushback(8); /* return the WORD */ }
{ALPHANUMERIC}+ ({DELIMITER}{ALPHANUMERIC}+)* "-deficient"  { yypushback(10); return /* return the WORD */ }
{ALPHANUMERIC}+ ({DELIMITER}{ALPHANUMERIC}+)* { return /* return the WORD */ }

Note that order is important, since the sequence relies on the first two patterns having higher precedence than the last pattern. (Inputs which match one of the first two patterns will also match the last pattern, but with the rules in the order indicated the last pattern will not win.)

That might or might not be what you really want. It will handle MXs12-labeled and MXs12-C123 as indicated in your question. MXs12-labeledblack and MXs12-labeled-black will both be reported as single tokens; it's not at all clear to me what your expectations are on these inputs.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Were I to use your code but separated regular expressions and return sections, would precedence matter? First two matches would always be longer than the last one, hence being preferred by jflex if I understand it correctly. – matwasilewski Jul 18 '19 at 11:15
  • @mateusz: no, because the general pattern will also match the forbidden suffix. They're not separated. – rici Jul 18 '19 at 12:14
1

Rici's answer solved the problem - yypushback() was exactly what I needed. As of now

  1. jflex catches all strings, with or without a suffix
  2. there is additional java regex in the output section for ACRONYMS, checking if the string has a suffix and using yypushback() if so.

With additional java regex, I can cover the mentioned edge cases, e.g. "\\-labeled$" ensures that suffix is at the end of passed string and MXs12-labeled-black will be returned as one token, whereas MXs12-labeled as three. Thank you very much!

matwasilewski
  • 384
  • 2
  • 11
  • This will work but it does unnecessary work. I suspect your intuitions tell you that giving JFlex more patterns makes JFlex do proportionately more work. But that's not true: the number of patterns has practically no impact on the execution time. (It does affect compile time. But that's not usually important.) The reasons for this are not going to fit into a comment; search out Russ Cox's excellent *Regular Expression Matching can be Easy and Fast* for a readable explanation. – rici Jul 18 '19 at 14:48