How do C/C++ parsers work?

Question

I've spent much time researching how PHP's parser works:

it translates PHP code to finally c code.

But how's c code translated to executables?

BTW, how to judge whether language A can be converted to language B somehow from mathematical aspect?

c/c++ don't use parsers, they're called compilers. There is a big difference. PHP code is not translated into c code. A program possibly written in C interprets the PHP code and produces the output. — squillman, Jan 21 '11 at 02:56
@EboMike And there's PHP to Java: http://www.caucho.com/resin-3.0/quercus/ — chrisaycock, Jan 21 '11 at 03:05
@squillman: C compilers most definitively use parsers. [GNU Bison](http://en.wikipedia.org/wiki/GNU_bison) is a common choice for a parser generator, which reads a language spec and outputs a parser. — André Paramés, Jan 21 '11 at 03:06
@squilman- parsing is one step in compilation. your statement is most definitely incorrect. — Dhruv Gairola, Jan 21 '11 at 03:12
@yoyo A parser parses the text. A compiler parses the text, maps the registers, optimizes the IR, generates the object code, etc. As @Dhruv says, parsing is the first step. — chrisaycock, Jan 21 '11 at 03:17
@André @Dhruv : yes, but I was going more high-level which is what I read into this question. You are correct, of course. — squillman, Jan 21 '11 at 03:18
@squillman: yes, but by that logic, PHP wouldn't either: it uses an interpreter. Of course, both the interpreter and the compiler actually have/use a parser. — André Paramés, Jan 21 '11 at 03:20

score 10 · Answer 1 · answered Jan 21 '11 at 03:10

This is a really great and really deep question that draws on a lot of parts of computer science.

Ultimately, all programs on a computer execute by issuing instructions to the processor in machine code. There is no one "machine code," and every processor has its own set of instructions that can be executed. These are usually low-level operations like "load a value into memory" or "add two values together." In theory, every program can be written in machine code, but this is rarely the case. Machine code is essentially series of zeros and ones that are decoded in particular ways by the processor, and it would be all but impossible to build any complex system directly this way.

One step above machine code is assembly language, a very low-level macro-esque language that usually has a one-to-one mapping with the machine code. For example, you might have commands like "add" that do addition, "sub" for subtraction, or "call" for function calls. Ultimately, the code is converted to machine code using an assembler, a program that translates the assembly to machine code. It's possible to build big complex systems in assembly, but it's very difficult.

Many programming languages like C and C++ are compiled, which means that a special program called the compiler translates the source code down into assembly language, which can then be converted directly to machine code. In this way, you can program code that works at a high level - it can have variables, functions, objects, templates, exceptions, etc. - but which can operate directly on machine hardware. Other programming languages are interpreted, which means that a special program called an interpreter parses the source code, builds up some in-memory representation of it, and then translates it to assembly either indirectly (by using the program to control what instructions to execute) or directly (by generating assembly on an as-needed basis).

The theory of how to convert from one language into another has been extensively studied. There are many challenges, ranging from "how do you even look at the source code of the program and understand what you're looking at?" to "what is the most efficient way to convert this program into some other language?" The former involves lexing, parsing, and semantic analysis; the latter involves optimization and code generation.

Typically, a program in any language can be converted into an equivalent program in another language, though there can be a notable loss in efficiency. Some programming languages have special functions that access underlying hardware and thus can't be written in languages that don't have access to that hardware, but this is rarely the case. One typical measure of whether a program can be rewritten in another language is to ask whether the two languages are Turing-complete, a mathematical term indicating whether the programming language is expressive enough to encode particular classes of functions.

Hope this helps!

score 7 · Answer 2 · answered Jan 21 '11 at 03:14

PHP doesn't really 'translate' to C code; the PHP interpreter interprets PHP while running executable code, and the PHP interpreter is a state machine that knows how to execute all the PHP in the process. No intermediate C is needed or desired. Because it is interpreted, every time the PHP executable interprets the PHP program, it is re-evaluated.

The PHP interpreter is written in C, but it could have been in C++ or Assembly or Pascal or Erlang or bash or Java or whatever else you might wish. (I think it started out in Perl, but my memory is getting fuzzy.)

C is compiled with a compiler, which is run once before the program may be run thousands of times. Most C compilers make several 'passes': lexing input into tokens, parsing the tokens into a tree, then modifying the Abstract Syntax Tree to generate symbol tables for each of the scopes of execution. After the abstract syntax tree has been subjected to various optimizations such as dead code removal and static single assignment, the tree is passed to a code generator which will generate the required object file for the input that can run on the target architecture in question. The object file is linked with the linker to objects (functions and variables in C) not defined in that specific translation unit so that the program can be loaded by the linker/loader at run time.

The dragon book is the usual best source for learning about compilers, but I recommend Pragprog's Language Implementation Patterns instead.

+1 for the dragon book, was going to mention that. I highly recommend it. — Cam, Jan 21 '11 at 03:18
Can you elaborate how code generator is done,IMO,the most significant difference between parser/code generator is that: code generator maps language A to language B,while a parser works within the same language syntax. — yoyo, Jan 21 '11 at 03:33
@yoyo: No. If you would have a 'compiler' that translates language A to language A, then the parser and the code generator perform exactly opposite tasks. The parser converts a text in language A to a set of data structures (the internal representation, IR). The code generator takes such a set of data structures and converts that to a "text" in language B. (Text in quotes, because language B can use a binary representation, such a machine code) — Bart van Ingen Schenau, Jan 21 '11 at 09:42

Dhruv Gairola · Answer 3 · 2014-02-11T16:53:34.250

From my understanding, your main problem seems to be about the compilation process. As you mentioned in your comments, you're confused about a parser and compiler. Let me help you a bit:

alt text

Parsing is just one of the steps in the compilation process. To better understand your question, you must first understand how compilers work. Generally, the above few steps are usually employed by compilers. To understand the above will take a bit of work. If you want to delve further, read the lectures from this link. The source and target code depend on the context. Usually the source code is a high level language, and the target code is machine code.

score 2 · Answer 4 · answered Jan 21 '11 at 03:03

2

how to judge whether language A can be converted to language B somehow from mathematical aspect?

If both languages are Turing complete, then one can be translated into another.

As for your PHP to C assumption, there are "source to source" compilers like HipHop, but this isn't the common case. Most dynamically typed languages are compiled to a byte code and run on a virtual machine.

As for C, the compiler translates it essentially to assembly language for the target processor.

If you want to know more, you can read-up about compiler design, abstract syntax trees, and language semantics. It's a lot to take-in at once if you're new to it though, so Stack Overflow really isn't the best place to get started with a topic this big.

answered Jan 21 '11 at 03:03

chrisaycock

36,470
14
88
125

Not true. The only languages which run on a virtual machine are languages on the JVM or CLR. Most languages are in fact interpreted. – Billy ONeal Jan 21 '11 at 03:07
3

@Billy Whaaa??? That's news to these guys: [Parrot VM](http://www.parrot.org/) (Perl 6), [YARV](http://en.wikipedia.org/wiki/YARV) (Ruby), [Python VM](http://www.troeger.eu/teaching/pythonvm08.pdf). Now shell languages are definitely interpreted, but JVM and CLR are hardly the only VMs out there. – chrisaycock Jan 21 '11 at 03:11
1

@Billy ONeal, Erlang, Dalvik, LLVM, Lua, Forth, Flash, Smalltalk, SQLite, Squeek, Valgrind, Zend too.. Most Computer Science students have written both language interpreters and virtual machines, but for toy languages their professors designed to make the task approachable. :) – sarnold Jan 21 '11 at 03:21
1

@Billy, to complement crisaycock's answer, LLVM is also a virtual machine, even though it's typically used to compile C/C++ code ahead of time. – André Paramés Jan 21 '11 at 03:22
Smalltalk, Lua, Zend, and SQLite are not VMs. Smalltalk can be run on something like a VM, but this is not apparent from the language itself (and in Smalltalk-80's heyday, it certainly didn't use anything like a VM). Lua is interpreted. Zend is an opcode cache -- it merely caches some of the parsing of PHP; it is not a VM. SQLite is completely interpreted. Not sure about the others. My point was not so much specifically about VMs and more that the unqualified "most ... languages" is rubbish. – Billy ONeal Jan 21 '11 at 03:47
Moreover, the concept of a VM as a whole is transparent to the user of a language -- the language behaves as if it were interpreted. A VM is merely a performance optimization, and when answering a question about how parsing and compilers operate, it's simply opening up a huge can of worms that is irrelevant to the question of how code is parsed. – Billy ONeal Jan 21 '11 at 03:48
@Billy Lua [ [1](http://en.wikipedia.org/wiki/Lua_(programming_language)#Internals)] and Smalltalk [ [2](http://en.wikipedia.org/wiki/Smalltalk#Just-in-time_compilation)] use VMs. You are correct about SQLite. – chrisaycock Jan 21 '11 at 04:12
@chrisaycock: I stand corrected on Lua. W.r.t. Smalltalk, that might be true of *current* smalltalk compilers, but certainly was not true in 1980 when Smalltalk was standardized. – Billy ONeal Jan 21 '11 at 04:48
and Ershov told how to build arbitrary compilers from A to B. First, write an interpreter I_A(B) for B in A. (You can do this because Turing-equivalence means one can simulate the other). Then for any program b in B, you can compile it by computing P(I_A(b)), where P is *partial evaluation*. The margin isn't big enough to explain partial evaluation is well, but it in essence P(X) symbolically simplifies X whereever it find an oppportunity to do so. E.g., The interpreter may have "if x then y"; P will change that to "y" if x is true in X. See http://en.wikipedia.org/wiki/Partial_evaluation – Ira Baxter Jan 22 '14 at 11:29

How do C/C++ parsers work?

4 Answers4

Linked