Using Perl look-ahead assertion to find individual list

Question

Given a list like this:

direct_SQL_statement ::=
  directly_executable_statement semicolon

directly_executable_statement ::=
    direct_SQL_data_statement
  | SQL_schema_statement
  | SQL_transaction_statement
  | SQL_connection_statement
  | SQL_session_statement
  | direct_implementation_defined_statement

direct_SQL_data_statement ::=
    delete_statement__searched
  | direct_select_statement__multiple_rows
  | insert_statement
  | update_statement__searched
  | truncate_table_statement
  | merge_statement
  | temporary_table_declaration

direct_implementation_defined_statement ::=
  "!! See the Syntax Rules."

apostrophe ::=
  "'"
/*
5.2     token and separator

Function

Specify lexical units (tokens and separators) that participate in SQL language.


Format
*/
token ::=
    nondelimiter_token
  | delimiter_token

identifier_part ::=
    identifier_start
  | identifier_extend
/*
identifier_start ::=
  "!! See the Syntax Rules."
identifier_extend ::=
  "!! See the Syntax Rules."
*/
large_object_length_token ::=
  digit+ multiplier

Is it possible to use Perl's look-ahead assertion to break it up into individual definition list?

I tried,

perl -0777ne 'print "$&\n^^\n\n" while /(?=\w+\s*::=)\w+\s*::=\s*.+/gs;'

but it just returned the whole thing (as if the look-ahead assertion is not working at all), while

perl -0777ne 'print "$&\n^^\n\n" while /(?=\w+\s*::=)\w+\s*::=\s*.+?/gs;'

comes up just too short:

direct_SQL_statement ::=
  d
^^

directly_executable_statement ::=
    d
^^

direct_SQL_data_statement ::=
    d
^^

direct_implementation_defined_statement ::=
  "
^^

I need to break it up into individual BNF definition chunks to further process, like this for the initial test data:

direct_SQL_statement ::=
  directly_executable_statement semicolon
^^


directly_executable_statement ::=
    direct_SQL_data_statement
  | SQL_schema_statement
  | SQL_transaction_statement
  | SQL_connection_statement
  | SQL_session_statement
  | direct_implementation_defined_statement
^^


direct_SQL_data_statement ::=
    delete_statement__searched
  | direct_select_statement__multiple_rows
  | insert_statement
  | update_statement__searched
  | truncate_table_statement
  | merge_statement
  | temporary_table_declaration
^^


direct_implementation_defined_statement ::=
  "!! See the Syntax Rules."
^^

Notes,

the above output is from the initial test data.
The whole A ::= B thing is called a BNF definition. the "^^" is only for visual indication that the separation is done properly.
the apostrophe and the following token are different BNF definitions and should be treated as such. The /* ... */ comment should be filtered out from the output.
comments may come without empty lines surrounding them. That's the reason I need to rely on the look-ahead assertion instead of the paragraphs mode.
The question comes as a follow up to How can EBNF or BNF be parsed?, of which the solution is "W3C EBNF doesn't end a production with a semicolon because a ::= operator comes after the LHS symbol of a new production."
The whole file can be found at github.com/ronsavage/SQL/blob/master/sql-2016.ebnf

zdim · Accepted Answer · 2022-01-23T20:48:27.807

With possible comments (/* ... */) that need be omitted:

perl -0777 -wnE'say for m{(.*?::=.*?)\n (?: \n+ | (?:/\*.*?\*/) | \z)}gsx' bnf.txt

This captures a line with ::= and all that follows it up to: more newlines, or /*...*/ comment, or end-of-string.

The modifier /s makes . match newlines as well, what it normally doesn't, so that .*? can match multiline text. With /x literal spaces are ignored and can be used for readability.

Or, first remove comments and then split the input string by more-than-one newlines

perl -0777 -wnE's{ (?: /\* .*? \*/ ) }{\n}gsx; say for split /\n\n+/;' bnf.txt

I don't see a need for lookaheads.

The original version of this post used a paragraph mode, via -00, or a regex that splits the whole input by multiple newlines.

That was exceedingly simple and clean -- with the input from the original version of the question, that is, which had no comments. The comments that were then added may have empty lines and reading in paragraphs doesn't fly anymore since spurious ones would be introduced.

I'm restoring it below since it's been deemed useful --

If there's always an empty line separating chunks of interest then can process in paragraphs

perl -00 -wne'print' file

This retains the empty line, which you appear to want to keep anyway. If not, it can be removed.

(Then curiously can evan do simply perl -00 -pe'1' file)

Otherwise, can break that string on more-than-one newline

perl -0777 -wnE'@chunks = split /\n\n+/; say for @chunks' file

or, if you indeed need to just output them

perl -0777 -wnE'say for split /\n\n+/' file

Empty lines between chunks are now removed.

I don't see a reason to go for a lookahead.

I realize that a "BNF definition" may be the line(s) after the one with ::=. In that case, one way

perl -0777 -wnE'say for /(.+?::=.*?)\n(?:\n+|\z)/gs' file

However, with possible comments (/* ... */) that need be omitted:

perl -0777 -wnE'say for m{(.*?::=.*?)\n (?: \n+ | (?:/\*.*?\*/) | \z)}gsx' bnf.txt

A reminder: all revisions to posts can be seen via the link which is right under a post, with the text of the last-edit timestamp.

I know the paragraphs mode, but that cannot be used -- I just put in more lines to proof that. Your last line works, would you explain it please as my Perl knowledge only remains in version 5.4. — xpt, Jan 13 '22 at 21:39
@xpt Yes, I see the addition ... can you explain the added sfuff? Can still parse in paragraphs but then skip whatever doesn't start with `... ::=` (for example, or select in some other way) — zdim, Jan 13 '22 at 21:41
@xpt The last line: it matches a line with `::=` and captures everything that follows (in `(.*?)`) up to a newline followed by either more newlines or end-of-string. It keeps going through the string doing this, due to `/g` modifier. The `/s` modifier makes `.` match a newline also, which it otherwise wouldn't. Should I explain this better and add to the post? (Here a "line" is really all of it) — zdim, Jan 13 '22 at 21:43
Haha, now I get it zdim, that's clever (I'll leave the look-ahead assertion in the question title as I'm thinking this might apply to other cases when people think that the look-ahead assertion is the only solution). Thanks!!! — xpt, Jan 13 '22 at 22:20
Hi zdim, I revisit here for the regex that splits the whole input by multiple newlines, for https://github.com/ronsavage/SQL/blob/master/sql-92.bnf, but found them gone. I.e., all your sample code cover various cases which are very useful, thus no need to hide in the revisions IMHO. Would you like to unhide them, or it is OK for me to do it for you? — xpt, Jan 23 '22 at 18:17
@xpt Alright! Restored. (Copied from older revisions, since I've edited the main text in the meanwhile so I couldn't merely roll back.) I'm glad to thatit's of some use :) — zdim, Jan 23 '22 at 21:40

Using Perl look-ahead assertion to find individual list

1 Answers1