Python interpretation model in comparison to direct and virtual machine compilation

Question

I have been compiling diagrams (pun intended) in hope of understanding the different implementations of common programming languages. I understand whether code is compiled or interpreted depends on the implementation of the code, and is not an aspect of the programming language itself.

I am interested in comparing Python interpretation with direct compilation (ex of C++)

and the virtual machine model (ex Java or C#)

In light of these two diagrams above, could you please help me develop a similar flowchart of how the .py file is converted to .pyc, uses the standard libraries (I gather they are called modules) and then actually run. Many programmers on SO indicate that python as a scripting language is not executed by the CPU but rather the interpreter, but that sounds quite impossible because ultimately hardware must be doing the computation.

Your diagrams are not detailed enough to show the difference between Python and Java. Just replace .java with .py and .class with .pyc. — cababunga, Jun 29 '12 at 18:56
@cababunga I see, so there is a Python compiler and a Python virtual machine? — jII, Jun 29 '12 at 18:58

score 3 · Accepted Answer · 2012-06-29T21:10:23.103

First off, this is an implementation detail. I am limiting my answer to CPython and PyPy because I am familiar with them. Answers for Jython, IronPython, and other implementations will differ - probably radically.

Python is closer to the "virtual machine model". Python code is, contrary to the statements of some too-loud-for-their-level-of-knowledge people and despite everyone (including me) conflating it in casual discussion, never interpreted. It is always compiled to bytecode (again, on CPython and PyPy) when it is loaded. If it was loaded because a module was imported and was loaded from a .py file, a .pyc file may be created to cache the compilation output. This step is not mandatory; you can turn it off via various means, and program execution is not affected the tiniest bit (except that the next process to load the module has to do it again). However, the compilation to bytecode is not avoidable, the bytecode is generated in memory if it is not loaded from disk.

This bytecode (the exact details of which are an implementation detail and differ between versions) is then executed, at module level, which entails building function objects, class objects, and the like. These objects simply reuse (hold a pointer to) the bytecode which is already in memory. This is unlike C++ and Java, where code and classes are set in stone during/after compilation. During execution, import statements may be encountered. I lack the space, time and understanding to describe the import machinery, but the short story is:

If it was already imported once, you get that module object (another runtime construct for a thing static languages only have at compile time). A couple of builtin modules (well, all of them in PyPy, for reasons beyond the scope of this question) are already imported before any Python code runs, simply because they are so tightly integrated with the core of the interpreter and so fundamental. sys is such a module. Some Python code may also run beforehand, especially when you start the interactive interpreter (look up site.py).
Otherwise, the module is located. The rules for this are not our concern. In the end, these rules arrive at either a Python file or a dynamically-linked piece of machine code (.DLL on Windows, though Python modules specifically use the extension .pyd but that's just a name; on unix the equivalent .so is used).
- The module is first loaded into memory (loaded dynamically, or parsed and compiled to bytecode).
- Then, the module is initialized. Extension modules have a special function for that which is called. Python modules are simply run, from top to bottom. In well-behaved modules this just sets up global data, defines functions and classes, and imports dependencies. Of course, anything else can also happen. The resulting module object is cached (remember step one) and returned.

All of this applies to standard library modules as well as third party modules. That's also why you can get a confusing error message if you call a script of yours just like a standard library module which you import in that script (it imports itself, albeit without crashing due to caching - one of many things I glossed over).

How the bytecode is executed (the last part of your question) differs. CPython simply interprets it, but as you correctly note, that doesn't mean it magically doesn't use the CPU. Instead, there is a large ugly loop which detects what bytecode instruction shall be executed next, and then jumps to some native code which carries out the semantics of that instruction. PyPy is more interesting; it starts off interpreting but records some stats along the way. When it decides it's worth doing so, it starts recording what the interpreter does in detail, and generates some highly optimized native code. The interpreter is still used for other parts of the Python code. Note that it's the same with many JVMs and possibly .NET, but the diagram you cite glosses over that.

Phil Cooper · Answer 2 · 2012-06-29T20:41:40.977

1

For the reference implementation of python:

(.py) -> python (checks for .pyc) -> (.pyc) -> python (execution dynamically loads modules)

There are other implementations. Most notable are:

jython which compiles (.py) to (.class) and follows the java pattern from there
pypy which employs a JIT as it compiles (.py). the chain from there could vary (pypy could be run in cpython, jython or .net environments)

edited Jun 29 '12 at 20:41

answered Jun 29 '12 at 18:58

Phil Cooper

5,747
1
25
41

How does PyPy "compile .py to more .py"? It does almost exactly the same thing as CPython, as far as .py and .pyc files are concerned. – Jun 29 '12 at 19:04
@delnan the link explains but it uses a JIT. Runs faster than CPython (http://speed.pypy.org/). Uses a lot of dark magic. check out http://stackoverflow.com/questions/2591879/pypy-how-can-it-possibly-beat-cpython – Phil Cooper Jun 29 '12 at 19:11
1

I know the JIT, and the translation toolchain which produces it, about as well as one can without dabbling with the source code. However, the JIT does not compile .py files to .py files, and does not even start tracing at the stage you discuss. – Jun 29 '12 at 19:38
(takes one step backward) you are correct, it does not create another (.py). I'm not a user of it as it is a little too young for me. just wanted to highlight that other implementations in use have different answers to your question. – Phil Cooper Jun 29 '12 at 19:46
(1) It's not my question. In fact, I answered myself. (2) I don't use it either for serious uses, albeit only because of compatibility (Python 3 and possibly some libraries). (3) Outlining it's an implementation detail is good, but writing nonsense to that effect isn't. Why don't you remove that part? – Jun 29 '12 at 19:51
@delnan re:"... but writing nonsense to that effect isn't. Why don't you remove that part?" I will happily remove whatever you consider "nonsense". I think you refer to my 2nd comment in this chain (perhaps I should remove JIT from my first comment but I can't edit comments and the two links there may be informative to others. You may also be referring to the (.py)->(.py) in my original answer and Perhaps I should correct that. I will subsequently delete this comment, it's predecessor and anything else you think made no sense to clean up the comment thread. BWT +1 your your detailed answer – Phil Cooper Jun 29 '12 at 20:24
I'm just talking about the .py -> .py part, the rest isn't nonsense. The mention of the JIT is useful (you could move it to your answer so you got at least a bit of an answer on that part of the question). We don't need to remove any comments if you ask me (documentation), but that's your decision. I'll also remove my downvote once you edited (I'd do it now but I can't). – Jun 29 '12 at 20:30

score 0 · Answer 3 · answered Jun 29 '12 at 18:58

Python is technically a scripted language but it is also compiled, python source is taken from its source file and fed into the interpreter which often compiles the source to bytecode either internally and then throws it away or externally and saves it like a .pyc

Yes python is a single virtual machine that then sits ontop of the actual hardware but all python bytecode is, is a series of instructions for the pvm (python virtual machine) much like assembler for the actual CPU.

With regard to your last point, does the pvm output native machine code for the cpu much like the jvm? — jII, Jun 29 '12 at 19:11

Python interpretation model in comparison to direct and virtual machine compilation

3 Answers3