0

Context

I'm trying to generate a parser for BCP47 Language-Tag values, which are specified in ABNF (Augmented Backus–Naur form). I'm doing this in Haskell and would like to use the robust BNFC tool-chain, which expects LBNF (Labeled Backus–Naur form). I've searched for tooling to do this conversion automatically and could find none, so I'm basically attempting to write an LBNF for it using the ABNF as reference.

Attempted so far

I've done a lot of searching, and I think this question may be useful, but I can't get bnfc to accept any use of ε, it always spits out a syntax error at that character. For example,

Convert every option [ E ] to a fresh non-terminal X and add

X = ε | E.
-- ABNF option:
-- foo = [ E ]

-- Fresh X
Foo. Foo ::= X ;

-- add
X. X ::= ε | E ;

E. E ::= "e" ;
syntax error at line 8, column 10 due to lexer error

Giving up on that, I tried to get something even simpler working:

language = 2*ALPHA

I could not.

I've seen some BNF documentation (sorry I lost the link now) with an example for digits that looked like:

number ::= digit
number ::= number digit

This makes sense to me, so I tried the following:

LanguageISO2. Language ::= ALPHA ALPHA ;

token ALPHA ( letter ) ;

The fails to parse "en", but does parse "e n". It's clear why, but what is the right way to do what I'm intending?

I can make things kind of work by abusing token,

LanguageISO2. Language ::= ALPHA_TWO ;

token ALPHA_TWO ( letter letter ) ;

But this will quickly get out of hand as I handle 3*ALPHA and 5*8ALPHA, etc.

Specific Question

Could someone convert the following to LBNF so I can see the right approach to these things?

   langtag       = (language
                    ["-" script]
                    ["-" region]
                    *("-" variant))

   language      = (2*3ALPHA [ extlang ])

   extlang       = *3("-" 3ALPHA)         ; reserved for future use

   script        = 4ALPHA                 ; ISO 15924 code

   region        = 2ALPHA                 ; ISO 3166 code
                 / 3DIGIT                 ; UN M.49 code

   variant       = 5*8alphanum            ; registered variants
                 / (DIGIT 3alphanum)

   alphanum      = (ALPHA / DIGIT)       ; letters and numbers

Thanks very much in advance.

pbrisbin
  • 151
  • 2

0 Answers0