General language parser as a finite state machine

Question

I've written very low performance descent recursion parser for general language (open source, for EBNF grammars). And I want to fix its performance by rewriting parser.

I read about lexical analysis, LL, LR, LALR parsers and modifications like LL(*), I read first 3 chapters of Dragon Book (about lexers and parsers), I explored open source projects like ANTLR and others.

And I want to know why this algorithm is not described. Maybe it is wrong way, I don't know. Or maybe I reinvented the wheel.

Assume we have grammar (e: end of file):

A: B? B 1? e
B: 0 | 1

Grammar after transform:

A: B B 1 e | B B e | B e
B: 0 | 1

Possible scenarios:

[01] [01] [1] [e]
[01] [01] [e]
[01] [e]

We can build something like FSM:

Symbol #0:

[01]: continue

Symbol #1:

[01]: continue
[e]: parse as "B e"

Symbol #2:

[1]: parse as "B B 1 e"
[e]: parse as "B B e"

It will parse token stream at O(N). For real grammar it can be modified to be more than simple FSM, but still O(N).

So I have these questions:

Can this approach give positive results?
Has it some relations with LL, LR and other parsers? At this moment I don't have enough understanding of those algorithms, I didn't try any of them.
Which parsing algorithm is more fast for correct input string? I'm interested only in parsing correct input strings, because I'm making code generation tool for using with IDE, which can report errors itself. So I need most fast algorithm for this very specific task.

Thanks.

UPD:

I ended up with ANTLRv4, I found the target and runtime for my language (Swift) and I'm more than satisfied.

You forgot `B 1 e` in your transformed grammar. But it's not clear what you are asking. It's a well-known fact that the set of context-free languages is a superset of the set of regular languages. Are you asking how to recognize if a given grammar does, in fact, recognize a regular language? http://stackoverflow.com/questions/559763/regular-vs-context-free-grammars — chepner, Apr 01 '16 at 20:53
If you want somebody to tell you if you are re-inventing an algorithm, you should try asking on the Computer Science branch of Stack Exchange. FWIW, LR parsing using state machines generalized to push-down automata. — Ira Baxter, Apr 01 '16 at 20:56

Ira Baxter · Accepted Answer · 2021-09-15T14:26:19.260

2

LALR(k) is O(N) and can be lightning fast if you reduce it to machine code for "branch on token in state to next state, stack the token value". (See this paper: https://web.archive.org/web/20170809025652id_/http://www.genesishistory.org/content/ProfPapers/VF-LRParsing.pdf)

It is unclear what you'd gain by trying to develop your idea out; how would be it be faster than that?

[What tends to matter the the most isn't parsing; it is usually the rate at which you can build lexemes, esp. eliminating white space].

If you are serious about building a tool, you should work on the tool and take the best technologies you can get so you don't have to invent them.

If you insist on inventing new technologies, then you'll end up having the patch/enhance/tune them, and never get around to building a tool. Maybe you have a great idea. You'll have to invest more energy to find out.

Just be sure you know which goal you are trying to achieve.

edited Sep 15 '21 at 14:26

answered Apr 01 '16 at 20:55

Ira Baxter

93,541
22
172
341

I thought that parsing should be slower than tokenizing. Wow. That change everything. So I don't have to optimize parser. I have to use some of good algorithms (LALR(k) for example) and optimize lexer. – artyom.razinov Apr 01 '16 at 20:58
1

Parsing often is a bit slower than lexing taken effort-per-entity processed (lexers don't push nonterminals into a stack). The trouble is the lexer is processing character entities; the parser is processing lexical entities. As a sloppy measure, you can imagine the parser does one entity action for every 10 done by the lexer. If both take sort of the same amount of effort, its the lexer which dominates the cost. – Ira Baxter Apr 01 '16 at 21:01
I rewrite my parser and use LALR, but I'm just interesting. Is it theoretically possible to generate state machine without stacking nonterminals? If we have N tokens and M nonterminals, complexity in LARL can be O(N*M), is it theoretically possible to get O(N*1) ? – artyom.razinov Apr 01 '16 at 21:08
1

You can do this thought experiment: for every path though a stacking parser, generate a seperate branch. For finite input, you can do this, and produce probably a finite set of branches. For any grammar with a list of entities, you probably get exponential blowup in number of branches, so it doesn't seem like a good way to go. If you *could* do that, you would have a very fast parser... but now you'll discover there are other things your compiler has to do, and having it that fast probably doesn't buy you much in overall performance. So, exponential space, small programs, no gain... nah. – Ira Baxter Apr 01 '16 at 21:13
@artyom.razinov I've been thinking about a bidirectional lexer, so instead of stacking non-terminals you can run the lexer backward on a failed match. – Nick Strupat May 15 '20 at 02:30
@NickStrupat: If your program has only one (lexical) error in it, that might work. That frankly doesn't seem like a lot of help; our lexers produce a special token UNRECOGNIZED on encountering an error with precise location information (line/column start and end); this is almost always enough to make the fix rather obvious. If the program has multiple lexical errors, where do you start the backward scan? – Ira Baxter May 15 '20 at 14:57
@NickStrupat: So this is tricky. See my SO answer about parallel parsing which discusses McKeeman's approach where he has to start lexing in multiple places: https://stackoverflow.com/a/51983427/120163 – Ira Baxter May 15 '20 at 14:59

General language parser as a finite state machine

1 Answers1