To decide if you have the right answer, you ideally have to compare to some kind of standard. This is hard for a computer languages.
Comparing ASTs is going to be hard, because there are no standards for such. Each parser that builds ASTs, builds an AST whose structure is designed by the person that coded the parser.
That means if you build an AST-producing parser, and you get somebody else's AST-producing parser, you'll discover that the AST nodes you have chosen don't match the other AST. Now you'll have to build a mapping from your AST to the other one (and how will you know the mapping is valid?). You can try to make your parser generate the AST from another parser, but what you will discover is the AST you produce is influenced by the parsing technology you use.
We have a similar problem with the Java front end my company produces (see bio if you want to know more). What we settle for is testing that the answer is self-consistent and then we do a lot of long-term experiential testing on big pieces of code.
Our solution is to:
- (Build a parser, using the strongest parsing technology we can get (GLR). This means we can recognize certain constructs not easily recognized by other parsing technologies (LL, LR, ...), and thus produce AST nodes that other parsers would have a hard time producing. See comments below for an example where this matters. Even so, we produce AST nodes in way that avoids completely our having to hand-code AST node construction as demanded by most other parsing technologies; that tends to produce somewhat different ASTs than hand-coded).
- Parse a LOT of Java code (producing ASTs) to make sure we don't have parsing errors. [The JDK is a pretty good size example and it is easy to get]
- Our tools can take an AST and regenerate (prettyprint) source code, complete with comments but perhaps a somewhat different layout. We verify that
parsed-then-prettyprinted code also parses. We re-prettyprint the parsed prettyprinted version; this should be identical to the prettyprinted version since we always produce the same layout. This test is a good indication that our AST design and implementation doesn't lose anything about the source code
- Build symbol tables, resolve the meaning of names, and verify that the legal Java programs type-check according to our front end. That doesn't tell you anything about the nature of the AST except that it is good enough (which in fact, is good enough!) Because the type checking task is very complex (go check your local Java standard), it is also pretty fragile. If you don't have everything right, the type checking will likely fail when applied across a large body of code. Again, the JDK is a pretty good test of this. Note: a Java parser without name and type resolution isn't very useful in practice
- Produce JavaDoc like cross references that include hyperlinked source code from the above results. This means it is easy to manually check a bit code to see that name resolution (therefore AST construction) is sane.
- Live with the results, applying the front end to various program
analysis and transformtions of code. We find the occasional problem and fix it.
It is tough to get this right; you have to get close and keep the testing pressure on continuously, especially since Java the language keeps moving.
(We're at Java 8, and Java 9 is being threatened). Bottom line: it is a lot of work to build such a parser and check its sanity.
We'd love to have an independent set of tests, but we haven't seen one in the wild. And I would expect those tests if they exist (I assume Oracle and IBM have them) really don't test parsing and name resolution directly, but rather test that some bit of code compiles and runs producing a known result. Since we aren't building a compiler, we wouldn't be able to run such tests if we had them. We would be able to do the name resolution and type consistency checks and that would be helpful.
[We actually do this for a number of language front ends. You think Java is hard, try this with C++]