How to implement Lexical Analysis in Javascript

Question

Hey folks, thanks for reading

I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.

I began slowly with the basics : + - / * and parenthesis handling.

I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.

This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.

I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.

Analysis

Let's define what a calculator query is:

// NON TERMINAL EXPRESSIONS //
query     -> statement
query     -> ε // means end of query

statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number

number    -> integer
number    -> float

// TERMINAL EXPRESSIONS //
operator  -> [+*/%^-]

prefix    -> -

integer   -> [0-9]+

float     -> [0-9]+[.,][0-9]+

Javascript

Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:

(I added spaces to make it more readable)

var calcPat = 
/^ (\s*
    ( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;

If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part

I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:

I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:

-If the first two tokens are a number followed by an operator:

-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)

-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.

What's the problem ?

My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?

Thank you so much for having taken the time to read everything. Gael

(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)

You might want to give this tool a try: http://pegjs.majda.cz/online — Bart Kiers, Jan 18 '11 at 16:50
That's pretty cool, but (judging from what that's called) it produces PEG parsers (packrat parsers I guess), which are really a whole 'nuther thing. The classic "lexer + parser" approach is for building LL or LR parsers for context-free grammars (or almost context-free), and PEG describes a different class of language than CFGs. — Pointy, Jan 18 '11 at 17:34
Oh, and of particular note is the fact that with a packrat parser for a PEG language, you don't need a lexical analyzer at all :-) — Pointy, Jan 18 '11 at 17:36
@bart-kiers: great tool, thanks. I tried to build a parser myself a couple years ago and gave up (thankfully, it was just a hobby, not any job requirement). Maybe I'll have to pull that project back off the shelf. — keithjgrant, Jan 18 '11 at 17:48
@bart-kiers: Thanks for the link, I managed to build a working parser out of it, but since I don't know how it works and other legal stuff, I can't just start making money on their back like that ! — Gabriel S., Jan 18 '11 at 18:11

Pointy · Accepted Answer · 2011-01-18T17:32:50.983

10

You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.

The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.

You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.

Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.

edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.

edited Jan 18 '11 at 17:32

answered Jan 18 '11 at 16:49

Pointy

405,095
59
585
614

Indeed, otherwise you'll fall into the trap of the classic answer to [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Marcel Korpel Jan 18 '11 at 16:51
@Pointy Thank you for your time. Thinking and surfing about parsers and tokens, I came across this article from D. Crockford which I remembered reading a long time ago without understanding anything at all. Now it make much more sense. Do you think it is worth delving deeper into this and actually trying to build my own parser ? http://javascript.crockford.com/tdop/tdop.html – Gabriel S. Jan 18 '11 at 18:05
@Marcel Korpel: Thanks for that piece of advice ! It kinda makes me want to ask how to parse XHTML just to see people's reactions though :D – Gabriel S. Jan 18 '11 at 18:07
@Gaël building parsers is very educational and a lot of fun; it's a good way to exercise program design skills. As to that particular parsing technique, well, I don't know that much about it. I suppose it's as worthwhile as anything else. – Pointy Jan 18 '11 at 18:10
@Pointy I finally managed to build my own parser using the link I posted above. To think that I can use this to parse Javascript itself is quite stunning and was indeed a lot of fun ! Thanks ! – Gabriel S. Jan 19 '11 at 18:43
Cool, I'm glad it worked out. I spent an hour or so yesterday putting together a simple function to generate a lexical analyzer object from a list of token regular expressions. It was a fun project, though I don't know if I'll ever use the code :-) – Pointy Jan 19 '11 at 18:45
@Gaël - You might be interested in this project: https://github.com/aaditmshah/lexer It generates a dynamic lexer from a set of rules (patterns + actions). It also supports start conditions, multiple return values and global patterns. – Aadit M Shah Jan 27 '13 at 02:27

How to implement Lexical Analysis in Javascript

Analysis

Javascript

What's the problem ?

1 Answers1

Linked