I'm getting back into language design/specification (via BNF/EBNF grammars) after 20+ years of leaving the topic alone (since my CS undergrad degree).
I only vaguely recall the various related terms in this space, like LR(1), LALR, etc. I've been attempting to refresh via some googling and reading, but it's coming slowly (probably because I didn't fully understand this stuff back in school). So I'm probably doing things quite roughly.
I decided I would describe a toy language with a grammar, and then try to analyze and possibly optimize it, as a part of my re-learning.
NOTE: all the snippets below can also be found in a gist here.
I started with an EBNF representation (as processed/validated by this tool):
Program := WhSp* (StmtSemi WhSp*)* StmtSemiOpt? WhSp*;
Stmt := AStmt | BStmt | CStmt | DStmt;
StmtSemi := Stmt? (WhSp* ";")+;
StmtSemiOpt := Stmt? (WhSp* ";")*;
WhSp := "_";
AStmt := "a";
BStmt := "b";
CStmt := "c";
DStmt := "d";
Here are some valid matches for this language (one match per line):
_____
;;;;;
_;_;_
a
__a__
a;
a;b;
a;_b;
_a;_b;_
_a_;_b_;_
__a__;;
_;_a;_b;c;;;__;;__d;___a___
And here are some values that wouldn't be in the language (again, one per line):
ab
a_b
a;_b_c
I then hand-converted this to the following BNF form (as processed/analyzed by this tool):
Program -> StmtSemi FinalStmtSemiOpt .
StmtSemi -> WhSp StmtSemiOpt | StmtSemiOpt .
FinalStmtSemiOpt -> StmtOpt SemiOpt WhSpOpt | WhSpOpt .
Stmt -> AStmt | BStmt | CStmt | DStmt .
StmtOpt -> Stmt | .
StmtSemiOpt -> StmtOpt Semi | StmtOpt Semi WhSpOpt StmtSemiOpt | .
Semi -> WhSpOpt ; | WhSpOpt ; Semi .
SemiOpt -> Semi | .
WhSp -> _ | _ WhSp .
WhSpOpt -> WhSp | .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
That tool's analysis says my grammar is ambiguous. I guess that's not surprising or necessarily a bad outcome, but I know ambiguous grammars limit some kinds of analysis and automatic conversion or parser generation.
So... finally, here are my questions:
Is this a context-free grammar? What specifically makes it so, or would make it non-CFG?
[Edit: "Yes", see @rici's answer]
Can the language I'm describing be specified in a non-ambiguous grammar (BNF or EBNF)? Or is it just inherently ambiguous?If it's inherently ambiguous, what specific aspects of the language make it so? In other words, what minimally would I have to change/remove to arrive at a language that had a non-ambiguous grammar?Are there meaningful ways my BNF grammar form could be simplified and still describe the same language as the EBNF?Does the BNF grammar currently have left-recursion, right-recursion, or both? I'm having trouble convincing myself of the answer. Could the BNF be re-arranged to avoid one or the other, and what would the impacts (performance, etc) of that be?
[Edit: I believe the updated BNF only has right-recursion, according to the analysis tool.]
Apologies if I'm fumbling around with incorrect terminology or asking imprecise questions. Thanks for any insight you might be able to offer.
[EDIT: Here's a new BNF that I believe is equivalent but isn't ambiguous -- thanks to @rici for confirming it was possible. I didn't use any particular algorithm/strategy for this, just keep trial-n-error fiddling.]
Leading -> WhSp Leading Program | Semi Leading Program | Program .
Program -> Stmt | Stmt WhSp | Stmt WhSpOptSemi Program | Stmt WhSpOptSemi WhSp Program | .
Stmt -> AStmt | BStmt | CStmt | DStmt .
WhSpOptSemi -> Semi | WhSp Semi | Semi WhSpOptSemi | WhSp Semi WhSpOptSemi .
WhSp -> _ | _ WhSp .
Semi -> ; .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
So that seems to answer questions (2), (3), and partly (4) I guess.