I'm trying to parse SQL INSERT statements like:
INSERT INTO `Album` (`Title`, `ArtistId`) VALUES ('Blue Moods', 89);
using the following grammar written for Nearley:
main -> statement:*
statement -> insert_clause values_clause ";"
insert_clause -> "INSERT INTO" identifier parenthesis
values_clause -> "VALUES" parenthesis
parenthesis -> "(" list ")"
list -> value ("," value):*
value -> identifier | literal
identifier -> %IDENTIFIER | %QUOTED_IDENTIFIER
literal -> %STRING | %NUMBER
I have removed all the post-processor code for clarity as I think it shouldn't have much of a performance impact. The most expensive thing I'm doing there is calling Array.flat() in post-processors of main
and list
rules.
The generated parser takes over 20 seconds to parse 1.6 MB of SQL. In contrast, a recursive-descent parser which I wrote manually to do the same thing, takes under a second. The parser also gets slower and slower as I give it more input, so it's definitely not linear time.
I found an issue from Nearley Github which mentions 3 possible culprits for performance problems with Nearley:
- Not using a Tokenizer - I do use a tokenizer, so not a problem.
- Having an ambiguous grammar - originally I had problems with ambiguity, but I've since fixed them and the grammar in here shouldn't IMHO have any issues.
- Using right-recursive rules.
The last one I'm not sure of. The list
rule seems right-recursive to me, so I tried rewriting it as:
list -> (value ","):* value
But this had zero effect on performance.
I'm out of ideas now.
I'm pretty inexperienced with parser generators, so hopefully this is some sort of silly mistake that I'm making here.
PS. Ignore the "INSERT INTO" token. I know it should really be two separate tokens.