Rationale for design choices causing JVM/CLR languages to have long startup?

Question

I am contemplating designing a programming language and i'd like it to startup with about the same speed as CPython or Perl. In order to make the right design choices in my language to achieve this requirement, i'm looking at existing dynamic languages to see how their design choices impact their startup time. Many JVM or CLR-based language implementations have a much longer startup time than CPython or Perl. This suggests that a design choice was made in the design of the JVM and/or CLR which causes this. What was that choice, and why was it done that way?

This is a three-part question:

Is slow startup of dynamic JVM/CLR language implementations a fundamental design issue at all, or just a minor problem that could be corrected by improving language implementations?
If it's a design issue, then which design choices of the JVM, and which design choices of these languages, cause these languages to have longer startup latency than CPython and Perl?
What is being gained in exchange for being slow to start? That is, what benefits do JVM/CLR dynamic languages have that CPython and Perl lack, due to the design choices described by (2)?

Note that other SO questions already deal with "Why is the JVM slow to start?" and why various JVM languages are slow to boot. This question is distinct from that one because this question is about the design tradeoff; what is being gained in exchange for that long startup time?

Other SO questions ask how various JVM languages can be sped up by the user (and the answer is often to have some sort of daemon that pre-loads a JVM), but that is not what i'm asking here; i'm asking how you design a language (and/or virtual machine) that permits fast startup (without preloading), and what do you lose in exchange for this.

Background research

Speed of various language implementations

I benchmarked CPython and Perl in informal Hello World tests on my GNU/Linux machine, and found they start in less than 0.05 seconds. In the rest of this post, i will say 'fast' to mean "startup time that is not significantly longer than CPython or Perl's", and 'slow' to mean otherwise.

It is easy to find opinions that the JVM itself and/or Java are slow to start (3, 4, 5, 6), as well as concrete numbers on the order of 1 second or more (7, 27) and benchmarks (8). However two Hello World JVM benchmarks started in only 0.04 seconds (on GNU/Linux) (9, 10).

Clojure had a startup time of around 0.6-1 second (1, 2); this is about 20x slower than my target of 0.05 seconds. ClojureCLR is even slower (1). Clojure startup time benchmarks and discussion may be found in the blog posts Clojure bootstrapping (Kariniemi), Why is Clojure slow (Trojer).

One startup time benchmarker said that Clojure and JRuby were "significantly slower than everything else" (25); these were also the only two JVM-based dynamic languages tested. Another (very old) benchmark shows that Jython is also very slow to start (26). We are focusing on dynamic languages in this question, however it may be relevant that Scala is not incredibly fast either (1). There is a JVM Scheme called Kawa (23). Kawa's startup time was reported to be about 0.4 (24), which is faster than Clojure but still an order-of-magnitude above my target.

What are the implementations doing during startup?

Both (1, 2) conclude that Clojure is spending its startup time loading classes and initializing the clojure.core namespace. An answer to SO question "Clojure application startup performance" seems to be saying that the distinction between Java startup time and Clojure startup time is because Java lazily loads its standard library, whereas Clojure loads its eagerly. Answers to the SO question "Can any Clojure implementation start fast?" include "it's just an implementation issue that could be corrected, not a fundamental design choice" (paraphrased), and "One limitation of the JVM is that objects must be copied on initialization, "You can't embed any composite constants in byte code. Not even arrays."").

One blog post states that ClojureCLR's startup time is mostly spent JITing, and pre-JITing bringt the time down dramatically (although it still may be slow compared to CPython and Perl).

One explanation provided for why some JVM or Java programs are slow to start is the I/O of loading in many classfiles from a standard library (11). This hypothesis is supported by benchmarks that show a drastic improvement in JVM startup time for 'warm starts' where presumably the contents of standard library class files have already been loaded into the operating system's cache. Some say that much of the startup time is due to I/O reading in class files, but not because of the sheer volume of data, rather because of suboptimal organization of that data on disk (15, 16).

JVM's bytecode verifier is probably not a significant contributor to startup time, because 40% speedup of the verifier only translated to a 5% speedup of large program startup time (14).

What design choices (do not) lead to slow startup?

In (22), Kariniemi comes to the conclusion that Clojure startup is inherently slow to start due to the design choice of including dynamic features. However, i question this conclusion because CPython and Perl achieve much faster startup while still providing dynamism.

The use of bytecode cannot be the cause, because CPython also uses bytecode.

Because the I/O of loading classfiles seems to be at fault, one might suspect that the underlying design choice is the provision of a large standard library. However, this cannot be the cause, because CPython also provides a large standard library and is not slow to start. Also, although the slowness of Java is under dispute, it is worth noting that Java must load rt.jar upon startup, yet Hello World runs fast in Java according to some benchmarks.

The only thing I can add is this: Python has a large standard library but only 40 modules (CPython 3.4 REPL on Windows) of it are loaded upon startup (the minimum to set up everything and start running the user program), and a number of those are built-in, i.e. C code compiled into the main binary, which saves on file I/O. In addition, startup time has long been a concern of several core developers, and consequently has been optimized quite a bit. — , Jan 23 '15 at 22:49
This question is being used as an audit, and audit system thinks it is not "too broad". It could be a good question by itself, but it is still too broad. — Oleg Estekhin, Jan 28 '15 at 08:14
BTB, Perl 5 compiles to the ASG; it does not create bytecode or machine code. This makes compiling a little faster. It then executes the ASG with hand-optimized software, so it is fast as any VM. — shawnhcorey, Feb 01 '15 at 14:48

score 5 · Answer 1 · edited Feb 11 '15 at 20:46

Startup time is determined by the amount of work needed for runtime before it can actually start to execute any 'user code'. Let me compare what exactly happens with some choices.

Native Binary (C++ or so)

Operating system maps main executable file into memory. Even if this file is very large (a few GBs), mapping is still pretty fast. And it is very fast for typical file size of 10-50MB. Then some executable header is read, which provides list dynamic modules. Those modules are searched by OS and mapped in the same way. Then, possibly, some relocations take place. After this your code is ready to execute (although control at this point is probably given to your language runtime, not the code itself).

Scripting Languages

After all described in the previous section happens to the interpreter executable, it starts to read and execute provided script. Lets assume that no parsing/compiling to bytecode takes place (we already have everything in .pyc or similar bytecode format). With every module interpreter needs to load, it just allocates long enough chunk of memory and copies module content into it. Then it transfers control to this chunk of bytecode. Some work indeed has to be done on this stage, but usually not very much. For example, thats is bytecode of python dis module that will be executed on import dis.

Going to JVM

For JVM, it is not so easy. First, runtime can't just 'map' .class file to memory, nor read its content into memory and tell interpreter: "Hey! Here is your bytecode". It needs to be verified and resolved.

Verification purpose is to make sure that interpreter can execute without any futher runtime checks (out-of-function branches, stack overflow or underflow, type checking). Even if we assume O(number of instruction) time bound for verification, it is still pretty much, as every single instruction in the module must be checked. Remember, for scripting language we have a small amount of work on loading, usually just to fill 'export' dictionary with new functions and classes.

Resolve is a kind of optimization (and language design choice). Consider java code:

System.out.println("Hello, world!");

For this code java compiler puts into .class file information about println: println is a static method, with signature (ILJAVA/LANG/STRING;)V, from class java.lang.System. When a class containing the above line is loaded, JVM must look for java.lang.System (possibly loading it also in the process), find method println with this signature, and put pointer to this method somewhere so it can be later found when this line happens to execute.

This procedure must be performed for every unique method invocation in every loaded class. Same for every referenced class, field, interface and so on. So loading big .class file is not about "copying its content to memory" and "performing some environment adjustment".

With large enough standard library those operations alone already can lead to long startup time.

Compilation (at least optimizing compilation) is slow. Remember how long it can take to compile a decent size C++ project. So various tricks used to make this operation faster. In JVM (at least in some implementations) interpretation and JIT compilation can be performed in parallel, so code is interpreted by default, and JIT'ed if it is determined to be 'hot' (executed often). But interpretation is also slow. So it's not a magic bullet, just tradeoff between "doing things slow" or "not doing them at all and hope JIT will finish its work soon". Python without jit support just "doing it slow". But some performance critical parts of standard library (like dictionaries) are written in C (or Java, or C#, not in Python itself). Java standard library is written in Java. That's why it also must be JIT compiled, or will run slowly.

Summary

These slow startup times is a design issue. That is the price of being 'almost as fast as C' and highly dynamic at the same time.
Design choices leading to this slowdown are: bytecode verification in load time instead of execution time, load-time-linking and JIT compilation.
As I said, this allows JVM to generate code with JIT, which is almost as fast (and even faster on some tests) than code in a native language.

Conclusion

If you want low startup time, design your language in such way that runtime does not need to do a lot of work to load a module. At best, no more work than copy + environment update.

JIT overhead can be beaten with JIT cache, or Ahead-of-Time compilation. That is not the case if your language is completely dynamic (for example, you can override 'array.length' property in some module and standard library must also respect this change).

Your notes on verification contradict information in the question (that a 40% speedup of verification only gave a 5% speedup for launch time). — , Jan 26 '15 at 23:13
So verification takes around 12% of all startup time (as a lower bound). From java 0.5-1s startup time it is 0.05-0.1 seconds. That is already "too slow" for your target. So I see no contradiction here, you should get rid of this 'verification' to achieve your goal. — , Jan 26 '15 at 23:23
Re: verification; a minor correction: Given a method `void infiniteRecursiveLoop() { infiniteRecursiveLoop(); }`), which will produce a stack overflow when called, it should pass bytecode verification just fine. Bytecode verification cannot prevent (all) stack overflows from happening. — stakx - no longer contributing, Feb 01 '15 at 08:37
@stakx, yes, of course. I mean unbalanced push/pop opcodes within single function. — , Feb 01 '15 at 10:04

score 4 · Answer 2 · edited May 23 '17 at 12:34

First of all, most languages can in principal either be compiled to binary, or run in a runtime (or something in-between, like JIT). So what you’re actually asking is not so much about language design, but compiler and runtime design (see Compiled vs. Interpreted Languages).

Now there are broadly speaking three stages that you have to consider. Basically you’ll have to choose in which of those stages you want to spend more time in and in which less—it’s always going to be a trade-off.

Compile time
Initialization
Run time

Languages that are usually compiled, like C, spend a lot of time in (1) to optimize the binary being generated. A compiler can do one of two things: Either it tries to make the binary very small, so when the program is started, it is quickly loaded from the harddisk into RAM, thereby minimizing initialization time (2). But more commonly, compilers tend not to bother that much about binary size and instead try to optimize runtime performance (3). Function inlining is a classical example of optimizing runtime performance at the expense of binary size.

At the other end, code in languages that are usually interpreted, like JavaScript or even Bash scripts, have virtually no compilation step (so they minimize (1) to zero time). They then also have to decide between optimizing (2) and (3). Traditionally, interpreted languages only loaded the runtime and then started interpreting application code, but the advent of JIT (just-in-time compilation) has changed that somewhat to better run time performance (3) at the expense of potentially longer initialization (2).

To specifically answer your question: What is being gained in exchange for being slow to start? Primarily runtime performance. This used to be more of an issue when computers were a lot slower. In the case of JVM/CLR, portability is gained as well (if you have a JVM, you can run the same code on x86 and ARM architectures).

Functional languages like Clojure and Haskell have somewhat different issues to solve than the imperative languages mentioned above. For example, Haskell embeds it’s own smallish runtime into every binary the Haskell compiler generates. This enables among other things Haskell’s lazy evaluation semantics. So contrary to what I’ve said before, in functional language design, the language design (and not only compilation/runtime design) may very well have an impact on necessary program initialization time.

However, if you're going to write a compiler or interpreter, I'd recommend starting with an imperative language, since compiling functional code is conceptually much harder. If you like books, the Dragon Book is held in high regards by many.

Mike Nakis · Answer 3 · 2015-02-02T12:51:46.347

The meta-analysis of other people's studies that you are attempting to do is not a very good idea in my opinion. You have no knowledge of the variations between these peoples' setups, so you may be comparing apples with oranges, and you have no guarantees about the validity of the methodology that each one of them follows, so the apples may be rotten, the oranges may in fact be clementines, and there may be an occasional pear thrown in: you would not know.

The answer to your questions is that most of the slow startup time is due to fundamental design choices that we really want to have, and only a small part of it is due to not fundamental, but simply unfortunate design choices that we are stuck with for compatibility reasons and that can probably be avoided by better design. Luckily, the performance penalty of the fundamental design choices that we really want to have can probably be reduced by after-the-fact optimizations, but the people working on these things tend to be are pretty smart, so most things that could be improved have probably been improved already.

An example of a fundamental design choice is the use of an intermediate language such as bytecode or msil, which means that during startup you will be paying the penalty of running an optimizing compiler.

An example of a not fundamental but unfortunate design choice is the one-class-per-file choice of java, which was conceived back in the day when java was meant to be running inside web pages on the client side: this significantly reduces the startup time of a single class, but it introduces a huge additional performance penalty every time you do something which requires more of your classes to be loaded. By the time you have loaded enough classes as to complete your startup, you have generally paid a much greater total performance penalty than if you had waited to load all these classes at once.

Here is a list of issues that I can think of, which affect startup time. Most of them are fairly obviously important, so it would be pointless to explain what is gained by each one of them.

The use of an intermediate language. You are essentially passing your entire application and parts of the runtime that it references through an optimizing compiler every time you start it. This can be optimized by caching the binaries on installation, or on first run, and reusing them on subsequent runs. I would guess that some runtimes out there are already doing it, or else it is not so much of a problem.
The use of a garbage collector. This is a fairly complex piece of software, so I would not be surprised if it has a somewhat costly initialization. For example, the garbage collector usually runs on a thread of its own, which means that application startup is burdened with launching at least one thread. (Which must have its own stack, etc.) This could be optimized by launching the extra garbage collector thread only if it is needed, so applications that terminate before allocating much memory would never need it. Also, many GUI applications could theoretically do all of their garbage collection during idle time, so they do not really need an extra thread for this, though the startup time of GUI applications is usually not of much concern.
The use of an isolated VM. This means that immediately upon launching, the VM must pre-allocate a huge chunk of memory to work with. Massive memory allocation generally incurs a significant performance penalty, but it saves us from additional post-startup small chunk allocations, which can add up to a much greater overall performance penalty on the long run. This is usually an operating system / hardware architecture issue, and I would not be surprised if Linux does a much better job at this than Windows. One way to optimize it is to establish some way of determining the end of your startup time, taking notice how much memory this required, and then on subsequent launches tell the VM to preallocate exactly that much memory: no less, so as not to incur the penalty of additional allocations, and no more, so as not to waste time allocating memory which is not going to be needed. Of course, you need to be smart about this, since the memory that you application will need to allocate at startup often depends on the command line arguments, which are likely to be different from run to run. As usual, no silver bullet.
Security considerations. This is an area on which I have little experience, so I cannot really provide much insight. I know that when my java program runs, every single byte of my bytecode gets examined by some verifier, and I know that every time I try to do something as innocent as loading a class, some security manager is invoked to opine as to whether I should be allowed to do so. For the life of me, I do not know why the JVM bothers with all this instead of just letting the operating system handle a process which crashes or a process which lacks sufficient privileges to do its job. I do some times entertain the notion that it is nothing but a marketing plot to add the "security" buzzword as one of the selling points of the language, but of course only within a <joking></joking> tag. I strongly suspect that you could do away with all this "security", but be sure to also consult with others on this topic.
The organization of class files. Java uses JAR files, which are essentially ZIP files, so the first time a class is needed from a JAR file the entire file has to be parsed, and possibly all of it decompressed and indexed, even if no other class will ever be needed from it. I am sure JVMs optimize this as much as they can, but still, I would be willing to bet that a different choice could yield better results. The benefit is, of course, that you can open your JAR files with your ZIP utility. The CLR uses DLL assemblies, the structure of which I do not know, but I would be willing to bet that they perform (or can potentially perform) better.
The fundamental decision to provide a rich runtime environment and the programming culture of reusing existing features to solve problems instead of providing possibly optimal but ad-hoc solutions. In C, your "Hello, World!" program would be passing the printf() function a pointer to a static array of characters. Nothing can perform better than that. In Java, a String object will be instantiated, additional memory will be allocated for an array of characters, the characters will be copied into that memory, and then a pointer to the string object will be passed to the System.out.println() function. The String class is quite complex, and a big part of its complexity has to do with post-startup performance. (See string interning.) Then, the println() function will probably be converting the unicode characters to ansi, which means that various text converting classes will be loaded, quite possibly along with unicode conversion tables, which are of a non-negligible size. These text converting classes will most probably be making use of the standard collections, where the HashMap alone is around 50 KB of bytecode. You see where this goes.

score 0 · Answer 4 · edited May 23 '17 at 11:45

Here is a summary of some of what other answers have said.

The potential design tradeoffs are:

latency vs. throughput
pre-compiling some library modules vs. std library only in bytecode
lazy loading module imports
safety vs. efficiency
efficiency of on-disk organization of modules

It seems unlikely that these are all equally important, but without experimentation, it's difficult to know which of these contribute the most to the startup latency of applications running on managed runtimes.

In more detail:

latency vs. throughput

If you are willing for initialization to take a long time, you can go faster later. Some things you can opt to do during initialization to improve later speed are:

load time compilation (or partial compilation)
load time linking
eagerly load standard library
possibly, do extra work while initializing the GC

The JVM does load time linking (deniss). Many JVM implementations do partial compilation during load time or initialization (deniss, mb21). CPython provides a large standard library but only pre-loads those modules essential at startup time (delnan); 14 implies that Java lazily loads its standard library but Clojure eagerly loads its. Mike Nakis suggests that the JVM may spend a lot of time initializing the GC.

pre-compiling some library modules vs. std library only in bytecode, only interpreted, and separate from the main executable

There are various ways that a set of library modules could be pre-compiled:

their bytecode could be compiled at load time
the runtime could load a set of libraries once, dump an image of its memory, and then next time just load in this image
they could be compiled into the VM's main executable
they could be written in a different language and compiled to binary libraries which the VM dynamically loads at runtime

Aside from increased implementation complexity, i can't think of any design tradeoffs against the first three of these. The last one does involve a tradeoff, because eschewing native code and writing the standard library in the language has other advantages: (a) portability (b) developers don't have to know how to write native code in order to contribute to the standard libraries (c) use of the standard library as sample code (d) the standard libraries don't require an FFI, nor do they require other special treatment (as they would if they were native but other libraries could not be).

CPython ships with some standard library modules compiled into its main executable (delnan). Some JVM implementations provide 'class data sharing', which appears to be a system for caching an image of those parts of the Java standard library which are always loaded upon startup. However, the JVM's 'class data sharing' apparently can't easily be used to precompile user libraries such as the standard libraries of Clojure, meaning that other languages on the JVM besides Java are out of luck. Some other languages, such as SBCL, provide a facility to dump an image of the interpreter's current state (20).

lazy loading module imports

An answer to SO question "Clojure application startup performance" seems to be saying that the distinction between Java startup time and Clojure startup time is because Java lazily loads library modules, whereas Clojure loads its eagerly. Except for increased complexity (which is important), I can't think of a design reason why any language should not support (at least optional) lazy loading of modules.

safety vs. efficiency

A language may wish to provide 'sandboxing', in which untrusted code can be executed with limited capabilities. This requires either verifying the code before running it, or dynamically inserting checks before potentially restricted actions (Mike Nakis). In the latter case, if one of these actions is to load a module, this dynamic checking might be done many times upon startup, adding to startup time.

Similarly, a language may wish to guarantee that even bytecode written by an adversary cannot 'crash' in an uncontrolled fashion that returns control to the OS, but at the worst can only cause an exception, transfering control via the in-language exception handler. This either requires a bytecode language which cannot syntactically express crash-causing situations such as illegal memory accesses, or it requires doing dynamic checks at runtime before each potentially dangerous action, or it requires verifying bytecode at load time to prove that it cannot cause a crash (deniss). The last of these choices increases load time of each module, adding to startup time.

The JVM provides sandboxing through dynamic checks and bytecode verification through a load-time check; both of these choices slow down the loading of library modules at startup to some degree, although perhaps a very small one.

efficiency of on-disk organization of modules

JVM libraries are spread across many files because they are one-class-per-file. This increases loading time compared to putting many classes into one file (5, Mike Nakis). JVM libraries are also encoded into .jar files, which uses the ZIP file format. Another file format might have been more efficient for loading, especially in those cases where only a few classes are needed from a large .jar with many classes (7, Mike Nakis).

In addition, JVM bytecode does not support composite constants in bytecode, meaning that composite constants must be built at runtime from a sequence of interpreted instructions run at initialization. (however, i think CPython is the same way). Allowing composite constants in bytecode would require that the VM spec would have to include a format for serialization of composite data structure, a significant increase in specification complexity.