What is a VM and why do dynamic languages need one?

Question

So, for example, Python and Java have a VM, C and Haskell do not. (Correct me if I'm wrong)

Thinking about what languages on both sides of the line have, I can't find the reason. Java is static in a lot of ways, while Haskell provides a lot of dynamic features.

Haskell is statically typed. (http://www.haskell.org/haskellwiki/Introduction) - what sort of "dynamic features" were you thinking of? — Jon Skeet, Jan 09 '11 at 18:09
What do you mean by "Haskell provides a lot of dynamic features"? It's possibly the most strongly-typed language in the world. — finnw, Jan 10 '11 at 23:44
@Jon Skeet, @finnw: "Generics" are perceived as "dynamic features", because common statically typed languages such as Java and C++ have bad support of generics (awkward syntax, unhelpful error messages, etc), when comparing to common dynamic languages like Python and Ruby which have good support of generics. — yairchu, Jan 14 '11 at 18:13
@yairchu: I think we must have a *very* different idea of what constitutes "support of generics" - my concept of generics doesn't even make sense in a dynamic-only language. I regard the "dynamic" nature of generics, such as it is, to be entirely different from the "dynamic" nature of Python and Ruby. — Jon Skeet, Jan 14 '11 at 18:31
@Jon Skeet: My idea of generics is no different from yours, I was just describing another legitimate point of view. — yairchu, Jan 15 '11 at 12:33
@yairchu: So in what way do Python and Ruby have "good support for generics"? They don't allow you to *express* generics in the way that I understand generics... From the Wikipedia article on generic programming: "Dynamic typing (such as is featured in Objective-C) and judicious use of protocols circumvent the need for use of generic programming techniques, since there exists a general type to contain any object." — Jon Skeet, Jan 15 '11 at 12:43

Oliver Charlesworth · Answer 1 · 2011-01-09T18:41:12.183

29

It's nothing to do with static vs. dynamic.

Rather, it's about becoming independent from the underlying hardware platform ("build once, run everywhere" - in theory...)

Actually, it's nothing to do with the language, either. One could write a C compiler that generates bytecode for the JVM. One could write a Java compiler that generates x86 machine code.

edited Jan 09 '11 at 18:41

answered Jan 09 '11 at 18:08

Oliver Charlesworth

267,707
33
569
680

3

A real world example: the "regular" C# compiler generates bytecode for the CLR, while the MonoTouch compiler generates native iOS code. – R. Martinho Fernandes Jan 09 '11 at 18:15
14

Someone *could*? Someone *has*. C to Java bytecode: [NestedVM](http://nestedvm.ibex.org/); Java to machine code: [gcj](http://gcc.gnu.org/java/). – ephemient Jan 09 '11 at 18:20
I'm not so sure about the C to JVM thing. JVM bytecodes are pretty high level and much of Java "leaks" through. Perhaps if you generate one very large function. – Axel Gneiting Jan 09 '11 at 18:20
4

Further example: the clang C compiler generates bytecode for the LLVM virtual machine. Mind you, LLVM is not what Java or .NET programmers would think of as a full-featured VM, since it doesn't provide an OS environment abstraction, just a CPU abstraction. – Steve Jessop Jan 09 '11 at 18:21
@Oli Charlesworth: becoming independent from the underlying hardware platform isn't only done for WORA. Becoming independent from the underlying hardware also has, for example, obvious security benefits (think about non-existent buffer overflow/overrun in Java, besides in C-written libs and in faulty [non compliant] VM implementations). Sandboxing is a very important aspect of VMs. Actually I do care more about the sandboxing/security apsect then about the portability aspect. – SyntaxT3rr0r Jan 09 '11 at 18:28
3

@SpoonBender: You don't need a VM to prevent buffer overflow. This could be done at the language level, with the compiler silently generating run-time checks as required. Think e.g. Ada. I agree, however, that sandboxing is an intrinsic benefit of VMs. – Oliver Charlesworth Jan 09 '11 at 18:33
14

You don't need a VM for "write once, run everywhere". Haskell code doesn't need to be rewritten to run on different platforms. Neither does ANSI C code for that matter. You just need a VM for "compile once, run everywhere". – sepp2k Jan 09 '11 at 18:39
1

@SpoonBender: and if you're talking security benefits, there may be no difference between emitting code for a VM, and emitting machine code. I run virtual x86 machines (well, actually my webhost does) - in the context of system administration that's a "VM", although in the context of compiler technology it probably isn't. Conversely, you *could* have a high-level, abstract VM with slim to no worthwhile sandboxing at all, it's just in practice you don't because people who want to design a VM for their language typically want to sandbox their language too. – Steve Jessop Jan 09 '11 at 19:08
Pascal was commonly implemented on a VM. The most famous one is USCD Pascal. – AProgrammer Jan 09 '11 at 19:14
For example the Valgrind is a VM for C to debug memory issues. – maksadbek Jan 30 '20 at 20:35

yairchu · Accepted Answer · 2018-10-23T09:56:51.063

Let's forget about VMs for a sec (we'll get back to those below, I promise), and start with this important fact:

C doesn't have garbage collection.

For a language to provide garbage collection, there has to be some sort of "runtime"/runtime-environment/thing that will perform it.

That's why Python, Java, and Haskell require a "runtime", and C, which does not, can just straight-forwardly compile to native code.

Note that psyco was a Python optimizer that compiled Python code to machine code, however, a lot of that machine code consisted of calls to C-Python's runtime's functions, such as PyImport_AddModule, PyImport_GetModuleDict, etc.

Haskell/GHC is in a similar boat to psyco-compiled Python. Ints are added as simple machine instructions, but more complicated stuff which allocate objects etc, invoke the runtime.

What else?

C doesn't have "exceptions"

If we were to add exceptions to C, our generated machine code would need to do some stuff for every function and for every function call.

If we then add "closures" as well, there would be more stuff added.

Now, instead of having this boilerplate machine code repeated in every function, we could make it instead call a subprocedure to do the necessary stuff, something like PyErr_Occurred.

So now, basically every original source line maps to some calls to some functions and a smaller unique part.

But as long as we're doing so much stuff per original source code line, why even bother with machine code?

Here's an idea (btw let's call this idea a "Virtual Machine").

Let's represent your Python code, which is for example:

def has_no_letters(text):
  return text.upper() == text.lower()

As an in-memory data-structure, for example:

{ 'func_name': 'has_no_letters',
  'num_args': 1,
  'kwargs': [],
  'codez': [
    ('get_attr', 'tmp_a', 'arg_0', 'upper'),  # tmp_a = arg_0.upper
    ('func_call', 'tmp_b', 'tmp_a', []),  # tmp_b = tmp_a() # tmp_b = arg_0.upper()
    ('get_attr', 'tmp_c', 'arg_0', 'lower'),
    ('func_call', 'tmp_d', 'tmp_c', []),
    ('get_global', 'tmp_e', '=='),
    ('func_call', 'tmp_f', 'tmp_e', ['tmp_b', 'tmp_d']),
    ('return', 'tmp_f'),
  ]
}

Now, let's write an interpreter that executes this in-memory data structure.

Let's discuss the benefits of this over direct-from-text-interpreters, and then the benefits over compiling to machine code.

The benefits of VMs over direct-from-text-interpreters

The VM system gives you all the syntax errors before executing the code.
When evaluating a loop, a VM system doesn't parse the source code each time it runs.
- Making the VM faster than the direct-from-text-interpreter.
- So the direct interpreter runs slower with long variable name, and faster with short variable names. This encourages people to write crappy mathematician-style code such as wt(f, d(o, e), s) <= th(i, s) + cr(a, p * d + o)

The benefits of VMs over compiling to machine code

The in-memory data structure describing the program, or the "VM code", will probably be much more compact than boilerplate-full machine code which does the same stuff again and again for every original line of code. This will make the VM system run faster because less "instructions" will need to be fetched from memory.
Creating a VM is much simpler than creating a compiler to machine code. You can probably do this now without even knowing any assembly/machine-code.

score 4 · Answer 3 · answered Jan 09 '11 at 18:16

A VM (Virtual Machine) is actually a tool for a language designer to avoid some complexity in writing the implementation of a language.

Basically is a specification of a virtual computer and how each piece of said computer will interact with the other. You can code some assumptions in this specification that can be used by the actual language or not.

In this specification usually you define how the processor/processors work, how the memory works, what read/write barrier are possible etc, and a simpler assembly language to interact with it.

The final language is usually translated (compiled) from the text files you are writing into a representation written for that machine.

This has some advantages:

you decouple the language from a specific hardware architecture
usually allows you to control what happens
different people can port to a different architecture
you have more information to let optimize the code
etc.

There is also the coolness factor: Look Ma i made a virtual machine :).

Rene Saarsoo · Answer 4 · 2011-01-09T20:12:42.517

A virtual machine is basically an interpreter that interprets a language closer to machine code. When real machine interprets real machine code, Virtual Machine interprets a made-up machine code. Some VM-s interpret machine code of an actual computer - these are called emulators.

It's easier to write an interpreter for a simple assembly-like language, then for the full high-level language. Besides, a lot of high-level code-constructs are often just syntactic sugar over some basic principles. So it's easier to just write a compiler that translates all those complex concepts to simple VM-language, so we don't have to write a complex interpreter but can get away with simple one (a VM). And then you have more time for optimizing the VM.

That's basically how most languages these days (that don't compile down to real machine code) are implemented.

The interpreter (VM) and compiler can either be separate programs (like java and javac), or they can be just one program (like with Ruby or Python).

Joel · Answer 5 · 2011-01-09T18:17:38.150

From the wikipedia entry on Virtual Machines:

"A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine."

The greatest asset of Virtual Machines is, in theory, code portability - "write once, run anywhere"

Probably the best known example of a Virtual Machine is the JVM, originally designed to run Java code, but now also increasingly used for langauges such as Clojure and Scala.

There's nothing specific to dynamic languages that means they need a VM. They do however need an interpreter, which could be built on a VM.

Why would they need an interpreter? – AProgrammer Jan 09 '11 at 19:51 — AProgrammer, Jan 09 '11 at 19:51

score 2 · Answer 6 · answered Jan 09 '11 at 18:10

There's no "need", any of these languages provide compilers that directly emit the machine code to implement the semantics of their language in a given architecture.

The idea of a virtual machine is to abstract away the architectural differences between all the different hardware and software manufacturers so that developers have a single machine to write to.

score 2 · Answer 7 · answered Jan 09 '11 at 18:10

Java and python can be compiled in a way that maintains platform independance. This holds even for C#. Advantages are that VMs are able to convert this mostly strongly typed bytecode into very good platform specific code with relativ low overhead. Since Java is intended to "build once - run anywhere", the JVM has been created.

score 2 · Answer 8 · answered Jan 09 '11 at 21:58

Imagine you created a programming language: you figured out the language semantics and developed a nice syntax.

However, a textual representation isn't enough: Having to parse text again and again when executing a program is inefficient, so it's natural to add an in-memory binary representation. Couple that with a custom memory manager, and you've basically got a VM.

Now, for extra points, either develop a bytecode format for serialization of your in-memory representation and a runtime loader, or, if you want to go the way of scripting languages, an eval() function.

For the grand finale, add a JIT.