6

I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...

However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.

In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:

2.4.  External Encodings

   External representations of terminal value characters will vary
   according to constraints in the storage or transmission environment.
   Hence, the same ABNF-based grammar may have multiple external
   encodings, such as one for a 7-bit US-ASCII environment, another for
   a binary octet environment, and still a different one when 16-bit
   Unicode is used.  Encoding details are beyond the scope of ABNF,
   although Appendix B provides definitions for a 7-bit US-ASCII
   environment as has been common to much of the Internet.

   By separating external encoding from the syntax, it is intended that
   alternate encoding environments can be used for the same syntax.

That doesn't really clarify matters.

Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?

Community
  • 1
  • 1
fge
  • 119,121
  • 33
  • 254
  • 329

2 Answers2

5

Refer to section 2.3 of RFC 5234, which says:

Rules resolve into a string of terminal values, sometimes called characters. In ABNF, a character is merely a non-negative integer. In certain contexts, a specific mapping (encoding) of values into a character set (such as ASCII) will be specified.

Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.

Community
  • 1
  • 1
Björn Höhrmann
  • 458
  • 4
  • 10
  • An example that I just wrote: `unescaped-normal-char = %x00-5B / %x7C / %x7E-D7FF / %xE000-10FFFF`. Just don’t forget to have pity on the poor humans that will read it, and add a comment like this: `; any Unicode code point except for "\", "{" and "}"`. (And check to make sure that the range you exclude is in fact correct, too!) – Chris Morgan Feb 25 '19 at 14:00
  • Heh, I just came here and went to write a comment correcting the previous comment, only to notice I was the one that wrote that comment! Well, the correction is that the comment should read “any Unicode *scalar value*”, not “any Unicode code point”; U+D800–U+DFFF are valid Unicode code points, but not valid Unicode scalar values, and unless you’re dealing with the menace that is UTF-16 and accessing it by code points (avoid doing so!), it’s scalar values that you care about. – Chris Morgan Jun 01 '21 at 08:17
1

If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).

E.g. the PowerShell grammar has lines like this:

double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)

Or in the Java specification:

UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter

UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
       u
       UnicodeMarker u

RawInputCharacter:
       any Unicode character

HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

The \, u, and hexadecimal digits here are all ASCII characters.

Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.

If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.

Community
  • 1
  • 1
Joey
  • 344,408
  • 85
  • 689
  • 683
  • Well, I do write parsers (I'm the maintainer of grappa); but I'd rather not invent Yet Another Grammar Language when there is a good one already defined, except for it being unfriendly towards i18n! – fge Mar 11 '15 at 07:34
  • In that case I'd say just use normal ABNF and make it clear that when specifying character data for terminals you're using their Unicode code points and not ASCII values. But that makes specifying terminals for entire Unicode character *classes* rather ... cumbersome. This may not be official in a way, but people should be able to understand it. – Joey Mar 11 '15 at 07:38