lightweight javascript to javascript parser

Question

How would I go about writing a lightweight javascript to javascript parser. Something simple that can convert some snippets of code.

I would like to basically make the internal scope objects in functions public.

So something like this

var outer = 42;
window.addEventListener('load', function() {
   var inner = 42;
   function magic() {
       var in_magic = inner + outer;
       console.log(in_magic);
   }
   magic();
}, false);

Would compile to

__Scope__.set('outer', 42);
__Scope__.set('console', console);
window.addEventListener('load', constructScopeWrapper(__Scope__, function(__Scope__) {
    __Scope__.set('inner', 42);
    __Scope__.set('magic',constructScopeWrapper(__Scope__, function _magic(__Scope__) {
        __Scope__.set('in_magic', __Scope__.get('inner') + __Scope__.get('outer'));
        __Scope__.get('console').log(__Scope__.get('in_magic'));
    }));
    __Scope__.get('magic')();
}), false);

Demonstation Example

Motivation behind this is to serialize the state of functions and closures and keep them synchronized across different machines (client, server, multiple servers). For this I would need a representation of [[Scope]]

Questions:

Can I do this kind of compiler without writing a full JavaScript -> (slightly different) JavaScript compiler?
How would I go about writing such a compiler?
Can I re-use existing js -> js compilers?

I mentioned the motivation. I need programatic access to closure scope so I can serialize and synchronize closures across different physical machines. — Raynos, Jul 27 '11 at 22:15
Hmmm, but wouldn't you end up serialising almost everything? Lexical scoping is very "parental" as opposed to, say, block scope. You trying to write a decent node module to implement horizontal scalability? — davin, Jul 27 '11 at 22:29
@davin Yes. I would indeed serialize everything. The aim is to be able to have backwards compatibility. — Raynos, Jul 27 '11 at 22:33
You might want to refactor how the compiled code comes out. There is currently no way to distinguish between `myvariable='string'` and `var myvariable='mystring'`. — Lime, Jul 28 '11 at 02:44
@Lime I just wrote a quick demo. I know there are plenty of edge cases I have not handled. — Raynos, Jul 28 '11 at 07:31
@Raynos: I came across this [Parser API from Mozilla](https://developer.mozilla.org/en/SpiderMonkey/Parser_API) today. Haven't looked at it in depth, but perhaps is something like what you're looking for. Found it at [dherman at mozilla](http://blog.mozilla.com/dherman/2011/06/28/the-js-parser-api-has-landed/) blog. — user113716, Jul 28 '11 at 22:22
@patrick_dw that means the compiler will only run on spidermonkey. So no node.js support :( — Raynos, Jul 28 '11 at 22:25
I wish I could add more bounty, because I am incredibly interested in an easish way to actually do this... — Lime, Aug 02 '11 at 03:32
@Raynos: Wow. I looked this up in Wikipedia... which defines it as a compiler that maps source code to source code. I guess that's OK as definition, but I'm astonished that I've never heard this term and I've been building them (I call them "source to source translators") for almost 40 years. (See my bio). But yes, according to this definition, you have to write a "transpiler". I don't see how knowing that helps much; its still a lot of work, and you still need a full JavaScript parser as a minimum. — Ira Baxter, Aug 03 '11 at 09:07
@IraBaxter once I get round to it [I'll write one](https://github.com/kaisellgren/ES-Transpiler), I also have a different set of [motivations for writing one](http://stackoverflow.com/questions/6506519/ecmascriptharmony-es6-to-javascript-compiler). — Raynos, Aug 03 '11 at 09:23

Ira Baxter · Answer 1 · 2011-08-01T06:44:51.810

I don't think your task is easy or short given that you want to access and restore all the program state. One of the issues is that you might have to capture the program state at any moment during a computation, right? That means the example as shown isn't quite right; that captures state sort of before execution of that code (except that you've precomputed the sum that initializes magic, and that won't happen before the code runs for the original JavaScript). I assume you might want to capture the state at any instant during execution.

The way you've stated your problem, is you want a JavaScript parser in JavaScript. I assume you are imagining that your existing JavaScript code J, includes such a JavaScript parser and whatever else is necessary to generate your resulting code G, and that when J starts up it feeds copies of itself to G, manufacturing the serialization code S and somehow loading that up. (I think G is pretty big and hoary if it can handle all of Javascript) So your JavaScript image contains J, big G, S and does an expensive operation (feed J to G) when it starts up.

What I think might serve you better is a tool G that processes your original JavaScript code J offline, and generates program state/closure serialization code S (to save and restore that state) that can be added to/replace J for execution. J+S are sent to the client, who never sees G or its execution. This decouples the generation of S from the runtime execution of J, saving on client execution time and space.

In this case, you want a tool that will make generation of such code S easiest. A pure JavaScript parser is a start but isn't likely enough; you'll need symbol table support to know which function code is connected a function call F(...), and which variable definition in which scope corresponds to assignments or accesses to a variable V. You may need to actually modify your original code J to insert points of access where the program state can be captured. You may need flow analysis to find out where some values went. Insisting all of this in JavaScript narrows your range of solutions.

For these tasks, you will likely find a program transformation tool useful. Such tools contain parsers for the langauge of interest, build ASTs representing the program, enable the construction of identifier-to-definition maps ("symbol tables"), can carry out modifications to the ASTs representing insertion of access points, or synthesis of ASTs representing your demonstration example, and then regenerate valid JavaScript code containing the modified J and the additions S. Of all the program transformation systems that I know about (which includes all the ones at the Wikipedia site), none are implemented in JavaScript.

Our DMS Software Reengineering Toolkit is such a program transformation system offering all the features I just described. (Yes, its big and hoary; it has to be to handle the complexities of real computer languages). It has a JavaScript front end that contains a complete JavaScript parser to ASTs, and the machinery to regenerate JavaScript code from modified or synthesized ASTs. (Also big and hoary; good thing that hoary + hoary is still just hoary). Should it be useful, DMS also provides support for building control and dataflow analysis.

Your DMS page doesn't have a useful "here is where you download / install / try / buy it" page. Not one I could find anyhow. — Raynos, Aug 01 '11 at 06:46
It isn't a tool that you can "download and play with in an afternoon". Contact me offline for a longer discussion; see bio. — Ira Baxter, Aug 01 '11 at 06:47

score 3 · Answer 2 · answered Jul 27 '11 at 22:21

3

I'd try to look for an existing parser to modify. Perhaps you could adapt JSLint/JSHint?

answered Jul 27 '11 at 22:21

hugomg

68,213
24
160
246

I was thinking that, but hacking around with those without understanding the code is a sure-fire way to create bugs. I was hoping there was some parsing technique for not having to write a full compiler – Raynos Jul 27 '11 at 22:24
1

@Raynos: If you want to handle all the issues that JavaScript raises, you should expect to have what amounts to a full JavaScript front end (parser, symbol table construction) [but not a full compiler]. I think the term "lightweight" isn't workable for this. – Ira Baxter Jul 28 '11 at 05:49
@Raynos: so exactly what are you hoping for? You seem to want something that doesn't handle the required complexity, and you don't want to deal with a beast that does. The JSHint/JSLint proposal doesn't require you to write a compiler; it requires you to use a JavaScript front end that already exists. What's the issue? – Ira Baxter Jul 31 '11 at 18:38
@IraBaxter I was curious whether there were any other solutions then take an existing js parser and hack it to meet your needs. – Raynos Jul 31 '11 at 18:42
@Raynos: Given that you have a custom problem, I'd doubt if your solution is just likely to fall on you. So, you have to customize something. Given that you need full JavaScript parsing capability to process Javascript effectively (well, there's a faint chance you might not but that seems like a bad bet and you wouldn't want to be surprised most of way through the process) customizing an existing parser seems like far the best bet. (@missingno: +1) – Ira Baxter Jul 31 '11 at 18:52

score 3 · Accepted Answer · answered Aug 01 '11 at 16:33

If you want something with a simple interface, you could try node-burrito: https://github.com/substack/node-burrito

It generates an AST using the uglify-js parser and then recursively walks the nodes. All you have to do is give a single callback which tests each node. You can alter the ones you need to change, and it outputs the resulting code.

score 3 · Answer 4 · answered Aug 03 '11 at 02:46

3

There is a problem with the rewriting above, you're not hoisting the initialization of magic to the top of the scope.

There's a number of projects out there that parse JavaScript.

Crock's Pratt parser which works well on JavaScript that fits within "The good parts" and less well on other JS.
The es-lab parser based on ometa which handles the full grammar including a lot of corner cases that Crock's parser misses. It may not perform as well as Crock's.
narcissus parser and evaluator. I don't have much experience with this.

There are also a number of high-quality lexers for JavaScript that let you manipulate JS at the token level. This can be tougher than it sounds though since JavaScript is not lexically regular, and predicting semicolon insertion is difficult without a full parse.

My es5-lexer is a carefully constructed and efficient lexer for EcmaScript 5 that provides the ability to tokenize JavaScript. It is heuristic where JavaScript's grammar is not lexically regular but the heuristic is very good and it provides a means to transform a token stream so that an interpreter is guaranteed to interpret it the way the lexer interpreted the tokens so if you don't trust your input, you can still be sure that the interpretation underlying the security transformations is sound even if not correct according to the spec for some bizarre inputs.

answered Aug 03 '11 at 02:46

Mike Samuel

118,113
30
216
245

So your message is, "use a bad (Pratt) parser", or use *just* a lexer (you can't do any serious JavaScript manipulation with this)? I simply don't understand why you think this is an answer. – Ira Baxter Aug 03 '11 at 03:16
@Ira, if the input is pretty standard non-obfuscated JS, then he can use the Pratt parser. If the input is messy JS and performance is not that important, then the es-lab parser is fine. If the transformations falls into the set of things that can be easily handled at a lexical level, then do lexical transformation. You're wrong about not being able to do serious JavaScript manipulation at a lexical level; often it's easier to do it with a parse tree, but sometimes performance tradeoffs justify the engineering effort. – Mike Samuel Aug 03 '11 at 03:45
OP wants to take arbitrary code and manipulate it, if I understand his request. If you have to do any interesting manipulation of a complex language, it is damn hard to do on lexemes. That's the reason people invented parsers. In this particular case, OP needs to know the value of all state variables --> the value of symbols, local or otherwise. I don't see how he can do that without understanding nesting, scopes, ... And you'd be crazy to do that without parsing. What specific complex task have you done with just lexemes? – Ira Baxter Aug 03 '11 at 04:46
@Ira, with lexical analysis only, I have identified all function bodies to assign readable names to anonymous functions and keep a timed log of all function entrances and exits for profiling and dynamic call graph generation. I have also done tracking of catch and finally blocks to allow enforcement of uncatchable exceptions. The value of state variables cannot be determined with any level of static analysis as it's undecidable. You can however associate all symbols with scope by matching curly braces and identifying var declarations and formal parameters all of which can be done lexically. – Mike Samuel Aug 03 '11 at 05:40
if you take lexical tokens and do all that nested brace counting and testing for things that look like function prototypes, you are in essence just building a bad parser. For tasks like the ones you've described, perhaps a bad parser works well enough. As you bring more and more of the language structure into the picture, you get more and more realistic parsing. If you want to do sophisticated transformation (try lifting a statement out of a function) you pretty much need the full parser. I don't think OP will be able to do his task with a parser of the style you suggest. YMMV. – Ira Baxter Aug 03 '11 at 09:18
@Ira, true but with the overhead of a stack, not the overhead of building, traversing, and collecting a tree structure in JavaScript. I agree that inlining requires significant work as does alpha-renaming which is why I used es-lab for [alpha renaming](http://code.google.com/p/es-lab/source/browse/trunk/src/util/alpharename.js) but there's a surprising amount you can do with lexing if you assume a valid program and don't care about associativity and precedence within expressions. I believe what the OP wants to do can be done lexically by adding an object & `with` per scope & hoisting decls. – Mike Samuel Aug 03 '11 at 16:47

score 0 · Answer 5 · edited May 23 '17 at 11:44

0

Your problem seams to be in same family of problems as what is solved with the JS Opfuscators and JS Compressors -- they as well as you need to be able to parse and reformat the JS to an equivalent script;

There was a good discussion on obfuscators here and the possible solution to your problem could be to leverage the parse and generator part from one of the FOSS versions.

One callout, your example code does not take into account the scopes of the variables you want to set/get and that will eventually become a problem that you will have to solve.

Addition

Given the scope problem for closure defined functions, you are probably unlikely to be able to solve this problem as a static parsing problem, as the scope variables outside the closure will have to be imported/exported to resolve/save and re-instantiate scope. Hence you may need to dig into the evaluation engine itself, and perhaps get the V8 engine and make a hack to the interpreter itself -- that is assuming that you do not need this to be generic cross all script engines and that you can tie it down to a single implementation which you control.

edited May 23 '17 at 11:44

Community

1
1

answered Aug 05 '11 at 22:33

Soren

14,402
4
41
67

`Magic` is hard with closure definitions..so you will have to create some artificial ID for each instance of each closure, and then also figure out a way of how to import/export enviorinment variables which are available to each closure instance. – Soren Aug 06 '11 at 01:21
@Scren closures aren't that difficult. You just keep a solid reference to how the scope chain is set up. The real problem is making sure `eval` and `with` don't break. That's going to be a right pain that I might just ignore. – Raynos Aug 06 '11 at 08:29
I think you are right -- however that also means that you can only solve this by complete parsing and keeping tracking of variable-names etc, so that you can create context blocks for each access point. Hence, you need to leverage one of the parsers either from one of the obfuscator/compressors, or by taking relevant parts from V8 – Soren Aug 06 '11 at 13:53

lightweight javascript to javascript parser

5 Answers5

Linked