Language lexing: better performance to lex a string all at once or individually?

Question

I'm attempting to build my first C-like programming language, likely an interpreter and I've just made the first step aka the lexer.

I've thought about taking the lazy route by simply lexing the entire source code stream all at one then then have the parser process the data.

I've noticed that many other compilers and interpreters only lex during parsing when the parser module asks for another token.

Is it quicker in terms of code performance for a program to lex source code all at once then parse the resulting tokens or lex and parse tokens individually?

It is faster to lex on demand. Otherwise you are adding a huge latency: no parsing can even start until lexing is complete. — user207421, Jan 02 '17 at 06:16
why exactly is that faster though? I'm trying to understand the theory. — Nergal, Jan 02 '17 at 06:22
Overall I think this is a good question, but it is a bad fit for StackOverflow. StackOverflow expects specific questions, this is a very broad question trying to cover all of the ways lexing can occur. As such I am voting to close. — Guy Coder, Jan 02 '17 at 13:28
To give somewhat of an answer, it depends. Some languages/grammars are designed so that you do not have to lex/parse the entire input to get some output, e.g. one reason RTF was designed was when memory was scarce so that text files of hundreds or thousands of pages could be read by only lex/parse the header and then scan ahead to certain tags that delimited pages. Once at a page tag then a full lex/parse of that page could be done. — Guy Coder, Jan 02 '17 at 13:38

uliwitness · Accepted Answer · 2017-01-06T18:52:49.387

"faster" is a bit of a fuzzy word. There are different kinds of speed (latency, absolute start-to-finish duration, compile speed, execution speed), and depending on how you implement your language's front-end and backend, either approach can be faster.

Also, faster is not always better. If your parser is technically faster, but uses too much memory, it could crash or at the least end up swapping, which would slow it down again. If your parser is lightning-fast but produces inefficient code, your users will pay for your faster development speed. You'll have to write actual code and run it in a profiler to be able to tell what is really better, and come up with which criteria are important to you.

Tokenizing/Lexing everything at once at the start means you might be able to optimize memory allocation and thus take less time resizing your token list etc., but it also means the entire file has to be lexed before it can even be partially parsed.

OTOH if you parse as-needed, you may have to append to your arrays in small steps more often, so you'll pay a memory penalty, but in case of e.g. an interpreted language like JavaScript, you may only have to parse the parts that are actually used for this run-through.

So a lot of it depends on the details of your language, and the hardware you expect to be running on. In embedded systems with little memory and no swap, you may have no choice but to progressively lex, as the whole program source code might not fit in memory. If your language's syntax needs a lot of lookahead, you may not see any benefit of progressively lexing because you're reading it all anyway...

Oh, your question might be slightly related to http://stackoverflow.com/questions/24306893/what-is-the-difference-between-compilation-and-interpretation/24308305#24308305 — uliwitness, Jan 06 '17 at 18:46
I should say that my target hardware would be possibly for both embedded and large systems. my language isn't exactly going to be a "one-size fits all" language but I'd definitely prefer that it'd have low-level features and speed to the extent that you can write OS kernels with the language. — Nergal, Jan 06 '17 at 20:01
Note that there is a difference between what your dev environment will be and what your deployment platform. It's fine if your compiler only cross-compiles to an embedded system in most cases, but e.g. if your deployment platform is a web server, and your language intended for implementing CGI as interpreted scripts, it has to run on whatever your server hardware is. Still, even then few people try to run servers like that on actual embedded hardware. I'd save optimizing your interpreter for that for version 2.0, unless you have an acute need right now. — uliwitness, Jan 08 '17 at 07:34

Language lexing: better performance to lex a string all at once or individually?

1 Answers1