36

Anyone can recommend a decent Javascript parser for Java? I believe Rhino can be used, however it seems an overkill for just doing parsing, or is it the only decent solution? Any suggestion would be greatly appreciated. Thanks.

quarks
  • 33,478
  • 73
  • 290
  • 513
  • You want to parse but not evaluate the javascript? – jball Jun 28 '11 at 18:51
  • 7
    What is your ultimate goal? Validate a script? Create an abstract syntax tree from a script? Something else? – Bart Kiers Jun 28 '11 at 18:51
  • @Bart kybrex wants to :) I often don't know my goal until it's accomplished :P – Lime Jun 28 '11 at 19:33
  • @jball & @Bart, I need to parse it, and and perhaps modify its contents. – quarks Jun 29 '11 at 00:46
  • @Bart, example: xa.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.x-a.com/xa.js'; in this script I need to get 'https://ssl' and 'http://www' texts and modify it. I know there are lots of logic involved to achieve this, however I think first thing to do is to parse the script first. – quarks Jun 29 '11 at 05:29
  • @Bart, this is done without executing the javascript,because the base objective is to 'transform' the javascript script text. – quarks Jun 29 '11 at 05:37
  • @jball if you're evaluating say, a free-form text field and are trying to allow javascript syntax characters but want to reject anything that would be executable javascript code, then a parser is about the only way to go. – avgvstvs Jan 14 '15 at 21:48
  • 4
    This is one of my favorite useful questions that is closed. +1 – mtyson Apr 11 '16 at 17:40
  • I would recommend https://github.com/graalvm/graaljs/tree/master/graal-js/src/com.oracle.js.parser as it is compatible with the most recent ECMA Script specs. – Julian Apr 10 '19 at 11:25

6 Answers6

14

When using Java V1.8, there is a trick you can use to parse with the Nashorn implementation that comes out the box. By looking at the unit tests in the OpenSDK source code, you can see how to use the parser only, without doing all the extra compilation etc...

Options options = new Options("nashorn");
options.set("anon.functions", true);
options.set("parse.only", true);
options.set("scripting", true);

ErrorManager errors = new ErrorManager();
Context context = new Context(options, errors, Thread.currentThread().getContextClassLoader());
Source source   = new Source("test", "var a = 10; var b = a + 1;" +
            "function someFunction() { return b + 1; }  ");
Parser parser = new Parser(context.getEnv(), source, errors);
FunctionNode functionNode = parser.parse();
Block block = functionNode.getBody();
List<Statement> statements = block.getStatements();

Once this code runs, you will have the Abstract Syntax Tree (AST) for the 3 expressions in the 'statements' list.

This can then be interpreted or manipulated to your needs.

The previous example works with following imports:

import jdk.nashorn.internal.ir.Block;
import jdk.nashorn.internal.ir.FunctionNode;
import jdk.nashorn.internal.ir.Statement;
import jdk.nashorn.internal.parser.Parser;
import jdk.nashorn.internal.runtime.Context;
import jdk.nashorn.internal.runtime.ErrorManager;
import jdk.nashorn.internal.runtime.Source;
import jdk.nashorn.internal.runtime.options.Options;

You might need to add an access rule to make jdk/nashorn/internal/** accessible.


In my context, I am using Java Script as an expression language for my own Domain Specific Language (DSL) which I will then compile to Java classes at runtime and use. The AST lets me generate appropriate Java code that captures the intent of the Java Script expressions.


Nashorn is available with Java SE 8.

The link to information about getting the Nashorn source code is here: https://wiki.openjdk.java.net/display/Nashorn/Building+Nashorn

Jmini
  • 9,189
  • 2
  • 55
  • 77
Luke Machowski
  • 3,983
  • 2
  • 31
  • 28
  • 1
    the Source constructor has a private access. You should use static methods like : Source.sourceFor() – herau Jun 10 '15 at 08:39
  • 1
    Thank you for that comment. The private access constructor was a change made between Java 8u_31 and Java 8u_45. Newer code should use that instead. I have also seen that they are planning on releasing this library officially in JDK 9 (or so). – Luke Machowski Jun 10 '15 at 17:47
  • @LukeMachowski : I'd be very curious to know how do you handle static code analysis ? like to prevent infinite loops etc... ? –  Oct 25 '15 at 17:25
  • @Copernic : I don't. For my use case I rely on Garbage In-Garbage Out. If the rules are written poorly then the compiled result will be generated poorly. Maybe others have some ideas. – Luke Machowski Oct 26 '15 at 19:32
  • The Nashorn source is changed to http://hg.openjdk.java.net/jdk8/jdk8/nashorn/ – Sisyphus Oct 08 '16 at 15:57
  • I need to move out of Nashorn due to termination at Java 15. Have to use apache JEXL with cutomized Arithmetic objects. – Blessed Geek Nov 06 '20 at 21:53
14

From https://github.com/google/caja/blob/master/src/com/google/caja/parser/js/Parser.java

The grammar below is a context-free representation of the grammar this parser parses. It disagrees with EcmaScript 262 Edition 3 (ES3) where implementations disagree with ES3. The rules for semicolon insertion and the possible backtracking in expressions needed to properly handle backtracking are commented thoroughly in code, since semicolon insertion requires information from both the lexer and parser and is not determinable with finite lookahead.

Noteworthy features

  1. Reports warnings on a queue where an error doesn't prevent any further errors, so that we can report multiple errors in a single compile pass instead of forcing developers to play whack-a-mole.
  2. Does not parse Firefox style catch (<Identifier> if <Expression>) since those don't work on IE and many other interpreters.
  3. Recognizes const since many interpreters do (not IE) but warns.
  4. Allows, but warns, on trailing commas in Array and Object constructors.
  5. Allows keywords as identifier names but warns since different interpreters have different keyword sets. This allows us to use an expansive keyword set.

To parse strict code, pass in a PedanticWarningMessageQueue that converts MessageLevel#WARNING and above to MessageLevel#FATAL_ERROR.


CajaTestCase.js shows how to set up a parser, and [fromResource] and [fromString] in the same class show how to get an input of the right kind.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
6

A previous answer describes a way to get under the covers of JDK 8 to parse javascript. They are now mainlining it in Java 9. Nice!

This will mean that you don't need to include any libraries, instead we can rely on an official implementation from the java guys. Parsing javascript programmatically is much easier to achieve without stepping into taboo areas of java code.

Applications of this might be where you want to use javascript for a rules engine which gets parsed and compiled into some other language at runtime. The AST lets you 'understand' the logic as written in the the concise javascript language and then generate less pretty logic in some other language or framework for execution or evaluation.

http://openjdk.java.net/jeps/236

Summary from the link above:

Define a supported API for Nashorn's ECMAScript abstract syntax tree.

Goals

  • Provide interface classes to represent Nashorn syntax-tree nodes.
  • Provide a factory to create a configured parser instance, with configuration done by passing Nashorn command-line options via an API.
  • Provide a visitor-pattern API to visit AST nodes.
  • Provide sample/test programs to use the API.

Non-Goals

  • The AST nodes will represent notions in the ECMAScript specification insofar as possible, but they will not be exactly the same. Wherever possible the javac tree API's interfaces will be adopted for ECMAScript.
  • No external parser/tree standard or API will be used.
  • There will be no script-level parser API. This is a Java API, although scripts can call into Java and therefore make use of this API.
Luke Machowski
  • 3,983
  • 2
  • 31
  • 28
3

Here are two ANTLR more or less working or complete (see comments on this post) grammars for EcmaScript:

From ANTLR 5 minute intro:

ANTLR reads a language description file called a grammar and generates a number of source code files and other auxiliary files. Most uses of ANTLR generates at least one (and quite often both) of these tools:

  • A Lexer: This reads an input character or byte stream (i.e. characters, binary data, etc.), divides it into tokens using patterns you specify, and generates a token stream as output. It can also flag some tokens such as whitespace and comments as hidden using a protocol that ANTLR parsers automatically understand and respect.

  • A Parser: This reads a token stream (normally generated by a lexer), and matches phrases in your language via the rules (patterns) you specify, and typically performs some semantic action for each phrase (or sub-phrase) matched. Each match could invoke a custom action, write some text via StringTemplate, or generate an Abstract Syntax Tree for additional processing.

Community
  • 1
  • 1
miku
  • 181,842
  • 47
  • 306
  • 310
  • 1
    That grammar is full of errors, try generating a lexer and parser from it: you won't succeed. – Bart Kiers Jun 28 '11 at 18:55
  • 1
    Automatically-constructed parsers for JavaScript are pretty hard to do, thanks to the regex syntax ambiguity. – Pointy Jun 28 '11 at 18:56
  • @Bart, thanks for your input. I've used ANTLR before, but not the linked grammar. Added another which might work better (according to the comments on the ANTLR page) and added a warning message to the other. – miku Jun 28 '11 at 18:59
  • 1
    @miku, yeah, that [first one](http://www.antlr.org/grammar/1206736738015/JavaScript.g) generates a lexer and parser properly, but I don't see the regex-literal, `/ ... /` defined anywhere in it... :) – Bart Kiers Jun 28 '11 at 19:03
  • 1
    The first one doesn't support regular expression literals or semicolon insertion properly (before end of input or curly bracket or `/*...*/` comment containing a line terminator). The second doesn't support semicolon insertion either. – Mike Samuel Jun 28 '11 at 19:30
  • 2
    Automatic semicolon insertion is hard, because in essence it has to be done only if not doing it would lead to a syntax errror :-{ For Javascript, not doing semicolon insertion means the parser is effectively useless. – Ira Baxter Jul 02 '11 at 07:51
  • you can find the link here: > for the First link: https://web.archive.org/web/20120622111320/https://www.antlr.org/grammar/1206736738015/JavaScript.g > for the Second link: https://web.archive.org/web/20120614221738/https://www.antlr.org/grammar/1153976512034/ecmascriptA3.g – Melvin Guerrero Mar 29 '22 at 22:58
0

EcmaScript 5 Parser for the java https://github.com/DigiArea/es5-model

-1

For me, the best solution is using acorn - https://github.com/marijnh/acorn under rhino.

I just don't think caja is getting attention anymore.

Matthew Kime
  • 754
  • 1
  • 6
  • 15
  • 1
    In my opinion, this does not answer the question. OP asked for JavaScript parsers written (or at least utilized in Java). Acorn is meant for usage in JavaScript. – Xiddoc Aug 03 '21 at 22:52