How can I disable all BNFC built-in rules, like Ident
, Integer
or the spaces being used to separate tokens?
I found them useless and annoying since they interfere with the parsers I'm trying to write.
I already tried to re-define them but it seems like the lexer continues to generate the rules for them. I could manually delete them from the generated files but I'm completely against modifying machine generated code.
Long version on why they are annoying.
I'm just starting to learn how to use BNFC. The first thing I tried is to convert a previous work of mine from Alex to BNFC. In particular I want to match only "good" roman numerals. I thought it would be quite simple: A roman numeral can be seen as a sequence like
<thousand-part> <hundred-part> <tens-part> <unit-part>
Where they cannot all be empty. So a numeral either has a non-empty thousand-part
and can be whatever in the rest, or it has an empty thousand-part
and thus either hundred-
or tens-
or unit-
part
must be non empty. The same thing can be iterated until the base case of units.
So I came up with this, which is more or less a direct translation of what I did in Alex:
N1. Numeral ::= TokThousands HundredNumber ;
N2. Numeral ::= HundredNumberNE ; --NE = Not Empty
N3. HundredNumber ::= ;
N4. HundredNumber ::= HundredNumberNE ;
N5. HundredNumberNE ::= TokHundreds TensNumber ;
N6. HundredNumberNE ::= TensNumberNE ;
N7. TensNumber ::= ;
N8. TensNumber ::= TensNumberNE ;
N9. TensNumberNE ::= TokTens UnitNumber ;
N10. TensNumberNE ::= UnitNumberNE ;
N11. UnitNumber ::= ;
N12. UnitNumber ::= UnitNumberNE ;
N13. UnitNumberNE ::= TokUnits ;
token TokThousands ({"MMM"} | {"MM"} | {"M"}) ; -- No x{m,n} in BNFC regexes?
token TokHundreds ({"CM"} | {"DCCC"} | {"DCC"} | {"DC"} | {"D"} | {"CD"} | {"CCC"} | {"CC"} | {"C"}) ;
token TokTens ({"IC"} | {"XC"} | {"LXXX"} | {"LXX"} | {"LX"} | {"LX"} | {"L"} | {"IL"} | {"XL"} | {"XXX"} | {"XX"} | {"X"}) ;
token TokUnits ({"IX"} | {"VIII"} | {"VII"} | {"VI"} | {"V"} | {"IV"} | {"III"} | {"II"} | {"I"}) ;
Now, the problem is that if I try to build this parser, when giving an input like:
MMI
Or in general a numeral that has more than one of the *-part
s not empty, the parser gives an error because BNFC cannot match MMI
with a single token and thus it uses the built-in Ident
rule. Since the rule doesn't appear in the grammar it raises a parsing error, although the input string is perfectly fine by the grammar I defined, it's the bogus Ident
rule that's in the way.
Note: I verified that if I separate the different parts with spaces I get the correct input, but later on I want to put spaces to separate whole numbers, not their tokens.