2

Given a lexer definition file, a grammar file (say, postgresql .y,.l flex and bison programs from it's source tree), and a file defined by those lexer and parser (say, an SQL query) to get the AST in some standard form (say, JSON of XML).

The most important aspect of this tool is - flexibility of the input format. In my example, I could recreate postgres SQL grammar in ANTLR - but I don't want to. I'd rather just use whatever postgres is using. So even though .y file contains more than the parsing rules - the tool that I'm looking for will be able to understand them with minor modifications.

Is there a generic tool that does that?

Here's a command line session with my imaginary tool ly2xml:

$ git clone git://postgres-git-url pg
$ find pg -iname *.[yl] -exec cp '{}' ~/ \;
$ echo 'SELECT * FROM (SELECT 1)'|ly2xml -parser=*.y -lexer=*.l - -O-
<SELECT>
  <ARGS>*</ARGS>
  <FROM>
    <SELECT><ARGS>1</ARGS></SELECT>
  </FROM>
</SELECT>

(note that - means it reads from standard input, and -O- means it writes to standard output).

Elazar Leibovich
  • 32,750
  • 33
  • 122
  • 169
  • What do you intend to do with XML for a wide variety of languages? – Ira Baxter Mar 29 '11 at 00:02
  • @Ira, once I get the AST in a nice form, it is much easier for me to reason about it. So continuing my example, I can find all queries that uses column `C` from table `T`, or all queries with more than three nested subqueries (to refactor them, they're too complicated). Without the query's AST I can only use bad heuristics for that. – Elazar Leibovich Mar 29 '11 at 05:53
  • @Elazer: It sounds like all you want is a SQL query parser, not a general language parser. ( If you want to reason about general language instances, you might find that the general parsers actually have pretty good pattern matchers.) – Ira Baxter Mar 29 '11 at 05:56
  • @Ira, sorry, no, it's just an example. today I want a Postgres SQL parser, tomorrow I want MySQL SQL parser, the next day I wonder which of my python scripts are using a certain language construct, and the next week I want to a list of al makefile with a certain pattern rule. This is just a use case example, I want general form of AST because it is easier to reason about. – Elazar Leibovich Mar 29 '11 at 08:22
  • @Elazer: agreed, much easier to reason about programs as structure than to reason about them as text. That's why we built DMS. – Ira Baxter Mar 29 '11 at 14:24
  • Elazar, did you ever find an answer? I want to preprocess Postgres' queries and have the exact same need – alex Aug 24 '12 at 09:39
  • @alex, I heard there are some vendors who sells parsers of Postgres queries, but I haven't found anything good enough. – Elazar Leibovich Aug 24 '12 at 12:31

1 Answers1

3

Nice thought. You're assuming one or more of:

 a) that each tool that has a grammar, uses a canonical parsing engine type (e.g., everybody uses bison)
 b) that there is some parsing tool that understands the zillion grammar specification schemes that exist
 c) that whatever the parser is, it will parse language fragments (perhaps well formed).

a) is clearly false. I've never seen b). Practically none of the parsing engines do c); they can only parse "full programs".

Your only hope IMHO is to use a parser generator that has a large number of well tested language definitions.

ANTLR is arguably one; it certainly has a long list of contributed language definitions. And they're all sort of findable in one place. Doesn't do language fragments, though, that I know of. Doubt if it has XML export for all parse trees.

Bison is arguably one; there are lots and lots of language processors built using Bison. But the definitions are scattered everywhere and it will be very hard to collect them. Also doesn't do language fragments. Pretty sure it doesn't have XML export.

Our DMS Software Reengineering Toolkit is arguably one. Has lots of language definitions. They're all collected in one place (our company). It does produce ASTs for every parse, and does have built-in XML export. DMS also can parse any language nonterminal for any language it knows.

DMS can simulate your example pretty well, given a DMS .lex, .atg ("attributed grammar") and a compatible source file.

What follows is a DMS lexer/parser-build and run, with XML export, for the Algebra grammar found at Algebra as DMS Domain (the ++XML halfway down the example is the parsing step being told to export XML):

C:\DMS\Domains\Algebra\Tools\Parser\Source>make
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -lexer
MakeDMSTool: Selected domain "Algebra".
LexerGenerator V2.1a
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing lexical specification ...
Processing mode Algebra ...
Exiting with final status 0
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -tool %Temporaries
MakeDMSTool: Selected domain "Algebra".
Using attribute grammar in "/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/Syntax/Algebra.atg"
AttributeEvaluatorGenerator V3.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing attribute grammar ...
Generating attribute evaluator(s) ...
Exiting with final status 0

rm -rf /cygdrive/c/DMS/Domains/Algebra/Tools/%Temporaries
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -prettyprinter
MakeDMSTool: Selected domain "Algebra".
PrettyPrinterGenerator V2.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved

Parsing pretty printer specification ...
Generating pretty printer ...
Exiting with final status 0

AttributeEvaluatorGenerator V3.0
Copyright (c) 1999-2010 Semantic Designs, Inc.; All Rights Reserved
Parsing attribute grammar ...
Generating attribute evaluator(s) ...
......................

Exiting with final status 0
cd /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/\%Generated; \
    perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -weave-preserve-productions %PreserveProductions.*.par
MakeDMSTool: Selected domain "Algebra".
perl /cygdrive/c/DMS/Executables/MakeDMSTool Algebra -parser
MakeDMSTool: Selected domain "Algebra".
export PARLANSEINCLUDEDIRECTORIES=`perl -e '($_ = $ARGV[0].";/cygdrive/c/DMS/Domains/PARLANSE/Library/Arrays;/cygdrive/c/DMS/Domains
/PARLANSE/Library/Bags;/cygdrive/c/DMS/Domains/PARLANSE/Library/HashTables;/cygdrive/c/DMS/Domains/PARLANSE/Library/Pipes;/cygdrive/
c/DMS/Domains/PARLANSE/Library/Sequences;/cygdrive/c/DMS/Domains/PARLANSE/Library/Sets;/cygdrive/c/DMS/Domains/PARLANSE/Library/Stac
ks;/cygdrive/c/DMS/Domains/PARLANSE/Library/Utilities;/cygdrive/c/DMS/Domains/PARLANSE/Library/Algorithms/Source;/cygdrive/c/DMS/Dom
ains/PARLANSE/Library/Booleans/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Characters/Source;/cygdrive/c/DMS/Domains/PARLANSE/Li
brary/Graphics/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/HashTrees/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Numbers/Sou
rce;/cygdrive/c/DMS/Domains/PARLANSE/Library/References/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/SQL/Source;/cygdrive/c/DMS/D
omains/PARLANSE/Library/Streams/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/SuffixTrees/Source;/cygdrive/c/DMS/Domains/PARLANSE/
Library/System/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/Search/Source;/cygdrive/c/DMS/Domains/PARLANSE/Library/TestSupport/So
urce") =~ s!//(.)/!$1:/!g; $_ =~ s!/cygdrive/(.)/!$1:/!g; print $_' "/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source;/cygdrive/c
/DMS/Domains/Algebra/Tools/Parser/Source/Components;/cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/%Generated;/cygdrive/c/DMS/D
omains/DMSStringGrammar/Tools/DomainParser/Source;/cygdrive/c/DMS/Domains/Algebra/Tools/Lexer/Source;/cygdrive/c/DMS/Domains/Algebra
/Tools/Lexer/Source/%Generated;/cygdrive/c/DMS/Domains/DMSLexical/Tools/DomainLexer/Source;/cygdrive/c/DMS/Infrastructure/HyperGraph
/Source;/cygdrive/c/DMS/Domains"`; \
    cd `echo /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source`; \
    nice /cygdrive/c/DMS/Domains/PARLANSE/Tools/Compiler/p0c.exe  DomainParser.par
PARLANSE0 Compiler V19.16.40
Semantic Designs, Inc. *** Confidential Information
128/485/133408 smallest/average/largest activation record/grain stack space required.
Largest stack space required by function at Line    1533
 in file FFIModule.par
89 grains.
3775 functions/procedures.
223447 lines of source code read.
7160772 bytes of object code.
No errors detected.
mv -f /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/Source/DomainParser.P0B /cygdrive/c/DMS/Domains/Algebra/Tools/Parser/DomainParser
.P0B

C:\DMS\Domains\Algebra\Tools\Parser\Source>run ../DomainParser ++XML C:\DMS\Domains\Algebra\Tools\Lexer\TestCase\algebraformula.txt
Domain Parser for Algebra 2.3.3
Copyright (C) Semantic Designs 1996-2010; All Rights Reserved
31 tree nodes in tree.
<DMSForest>
 <tree node="formula" type="1" domain="1" id="10qx0" parents="0" line="1" column="1" file="1">
  <tree node="product" type="4" domain="1" id="10qwx" line="1" column="1" file="1">
   <tree node="term" type="10" domain="1" id="10qwy" line="1" column="1" file="1">
<tree node="'D'" type="19" domain="1" id="10qw5" literal="0" line="1" column="1" file="1"/>
<tree node="'['" type="20" domain="1" id="10qw6" literal="0" line="1" column="2" file="1"/>
<tree node="formula" type="1" domain="1" id="10qwt" line="1" column="4" file="1">
 <tree node="product" type="4" domain="1" id="10qws" line="1" column="4" file="1">
  <tree node="term" type="9" domain="1" id="10qwr" line="1" column="4" file="1">
   <tree node="'('" type="17" domain="1" id="10qw7" literal="0" line="1" column="4" file="1"/>
   <tree node="formula" type="3" domain="1" id="10qwp" line="1" column="5" file="1">
    <tree node="formula" type="2" domain="1" id="10qwk" line="1" column="5" file="1">
     <tree node="formula" type="1" domain="1" id="10qwf" line="1" column="5" file="1">
      <tree node="product" type="5" domain="1" id="10qwe" line="1" column="5" file="1">
       <tree node="product" type="4" domain="1" id="10qwa" line="1" column="5" file="1">
    <tree node="term" type="7" domain="1" id="10qw9" line="1" column="5" file="1">
     <tree node="VARIABLE" type="15" domain="1" id="10qw8" line="1" column="5" file="1">
      <literal>x</literal>
     </tree>
    </tree>
       </tree>
       <tree node="'*'" type="13" domain="1" id="10qwb" literal="0" line="1" column="7" file="1"/>
       <tree node="term" type="8" domain="1" id="10qwd" line="1" column="8" file="1">
    <tree node="NUMBER" type="16" domain="1" id="10qwc" literal="23" line="1" column="8" file="1"/>
       </tree>
      </tree>
     </tree>
     <tree node="'+'" type="11" domain="1" id="10qwg" literal="0" line="1" column="10" file="1"/>
     <tree node="product" type="4" domain="1" id="10qwj" line="1" column="12" file="1">
      <tree node="term" type="7" domain="1" id="10qwi" line="1" column="12" file="1">
       <tree node="VARIABLE" type="15" domain="1" id="10qwh" line="1" column="12" file="1">
    <literal>y</literal>
       </tree>
      </tree>
     </tree>
    </tree>
    <tree node="'-'" type="12" domain="1" id="10qwl" literal="0" line="1" column="13" file="1"/>
    <tree node="product" type="4" domain="1" id="10qwo" line="1" column="14" file="1">
     <tree node="term" type="7" domain="1" id="10qwn" line="1" column="14" file="1">
      <tree node="VARIABLE" type="15" domain="1" id="10qwm" line="1" column="14" file="1">
       <literal>z</literal>
      </tree>
     </tree>
    </tree>
   </tree>
   <tree node="')'" type="18" domain="1" id="10qwq" literal="0" line="1" column="15" file="1"/>
  </tree>
 </tree>
</tree>
<tree node="','" type="21" domain="1" id="10qwu" literal="0" line="1" column="16" file="1"/>
<tree node="VARIABLE" type="15" domain="1" id="10qwv" line="1" column="18" file="1">
 <literal>x</literal>
</tree>
<tree node="']'" type="22" domain="1" id="10qww" literal="0" line="1" column="19" file="1"/>
   </tree>
  </tree>
 </tree>
 <FileIndex>
  <File index="1">C:/DMS/Domains/Algebra/Tools/Lexer/TestCase/algebraformula.txt</File>
 </FileIndex>
 <DomainIndex>
  <Domain index="1">Algebra</Domain>
 </DomainIndex>
</DMSForest>
Exiting with final status 0

C:\DMS\Domains\Algebra\Tools\Parser\Source>

If you really wanted an engine that understood many grammar notations, it might be easiest to build such an engine with DMS. Simply define each of the grammar formalisms (e.g., ANTLR or bison) as a DSL to DMS, parse a specific grammar formalism instance (e.g., ANLTR bnf instance) using DMS, apply DMS rewrite rules to transform that to a DMS grammar, and then build a DMS parser. (You'd have to do the same with the lexer, too.).

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • About `a`, I believe that there is a very limited number of automatic parser generators. Even if you only support bison and lex, you'll cover a good ground of the languages. Note that you don't have to re-implement all `bison` - you just have to implement something that reads a bison `.y` files and produce an executable that exports the AST to XML from the grammar. This is easier than the whole bison. – Elazar Leibovich Mar 29 '11 at 05:58
  • @Elazer: Regarding "very limited number": Wikipedia http://en.wikipedia.org/wiki/Comparison_of_parser_generators shows ~~ 100 and there are many more. Regarding bison, you won't be able to find the grammars easily. I did update my answer to say "convert bison/ANTLR/..." to an arbitrary standard and then drive a parser from that. – Ira Baxter Mar 29 '11 at 14:30
  • the question is no how many parser generators there are, but how many parser generators there are, which are used in important real world product. But it might be the case that there are too many parsers as well. – Elazar Leibovich Mar 29 '11 at 15:55
  • @Elazar: My point is you are fooling yourself if you believe you have a chance of getting a collection of grammars from all those real world products in a usable form. (An example: a clearly important real world product is GCC, *and it isn't even driven by a parser generator*, rather it is recursive descent). – Ira Baxter Mar 29 '11 at 16:00