66

Confused by java compilation process

OK i know this: We write java source code, the compiler which is platform independent translates it into bytecode, then the jvm which is platform dependent translates it into machine code.

So from start, we write java source code. The compiler javac.exe is a .exe file. What exactly is this .exe file? Isn't the java compiler written in java, then how come there is .exe file which executes it? If the compiler code is written is java, then how come compiler code is executed at the compilation stage, since its the job of the jvm to execute java code. How can a language itself compile its own language code? It all seems like chicken and egg problem to me.

Now what exactly does the .class file contain? Is it a abstract syntax tree in text form, is it tabular information, what is it?

can anybody tell me clear and detailed way about how my java source code gets converted in machine code.

nash
  • 683
  • 1
  • 6
  • 4
  • A language can easily compile its own language code. C/C++ compilers are often written in C or C++, the cobra language compiler is written in cobra, and there are many examples of http://en.wikipedia.org/wiki/Self-hosting compilers. – jcao219 Aug 04 '10 at 15:20
  • 5
    The compiler doesn't have to be platform independent, it just has to conform to specifications which only specify input and output. You could write a compiler in perl for all the resulting bytecode would care. – Mark Peters Aug 04 '10 at 15:23
  • related stackoverflow question: http://stackoverflow.com/questions/1220914/in-which-language-java-compiler-jvm-and-java-is-written – jvdneste Aug 04 '10 at 15:26
  • Not entirely relevant but it's good to mention that Sun's JVM is written in C, and Oracle's JVM (Hotspot) is written in C++. – gEdringer Feb 28 '17 at 16:43

9 Answers9

63

OK i know this: We write java source code, the compiler which is platform independent translates it into bytecode,

Actually the compiler itself works as a native executable (hence javac.exe). And true, it transforms source file into bytecode. The bytecode is platform independent, because it's targeted at Java Virtual Machine.

then the jvm which is platform dependent translates it into machine code.

Not always. As for Sun's JVM there are two jvms: client and server. They both can, but not certainly have to compile to native code.

So from start, we write java source code. The compiler javac.exe is a .exe file. What exactly is this .exe file? Isn't the java compiler written in java, then how come there is .exe file which executes it?

This exe file is a wrapped java bytecode. It's for convenience - to avoid complicated batch scripts. It starts a JVM and executes the compiler.

If the compiler code is written is java, then how come compiler code is executed at the compilation stage, since its the job of the jvm to execute java code.

That's exactly what wrapping code does.

How can a language itself compile its own language code? It all seems like chicken and egg problem to me.

True, confusing at first glance. Though, it's not only Java's idiom. The Ada's compiler is also written in Ada itself. It may look like a "chicken and egg problem", but in truth, it's only a bootstrapping problem.

Now what exactly does the .class file contain? Is it an abstract syntax tree in text form, is it tabular information, what is it?

It's not Abstract Syntax Tree. AST is only used by tokenizer and compiler at compiling time to represent code in memory. .class file is like an assembly, but for JVM. JVM, in turn, is an abstract machine which can run specialized machine language - targeted only at virtual machine. In it's simplest, .class file has a very similar structure to normal assembly. At the beginning there are declared all static variables, then comes some tables of extern function signatures and lastly the machine code.

If You are really curious You can dig into classfile using "javap" utility. Here is sample (obfuscated) output of invoking javap -c Main:

0:   new #2; //class SomeObject
3:   dup
4:   invokespecial   #3; //Method SomeObject."<init>":()V
7:   astore_1
8:   aload_1
9:   invokevirtual   #4; //Method SomeObject.doSomething:()V
12:  return

So You should have an idea already what it really is.

can anybody tell me clear and detailed way about how my java source code gets converted in machine code.

I think it should be more clear right now, but here's short summary:

  • You invoke javac pointing to your source code file. The internal reader (or tokenizer) of javac reads your file and builds an actual AST out of it. All syntax errors come from this stage.

  • The javac hasn't finished its job yet. When it has the AST the true compilation can begin. It's using visitor pattern to traverse AST and resolves external dependencies to add meaning (semantics) to the code. The finished product is saved as a .class file containing bytecode.

  • Now it's time to run the thing. You invoke java with the name of .class file. Now the JVM starts again, but to interpret Your code. The JVM may, or may not compile Your abstract bytecode into the native assembly. The Sun's HotSpot compiler in conjunction with Just In Time compilation may do so if needed. The running code is constantly being profiled by the JVM and recompiled to native code if certain rules are met. Most commonly the hot code is the first to compile natively.

Edit: Without the javac one would have to invoke compiler using something similar to this:

%JDK_HOME%/bin/java.exe -cp:myclasspath com.sun.tools.javac.Main fileToCompile

As you can see it's calling Sun's private API so it's bound to Sun JDK implementation. It would make build systems dependent on it. If one switched to any other JDK (wiki lists 5 other than Sun's) then above code should be updated to reflect the change (since it's unlikely the compiler would reside in com.sun.tools.javac package). Other compilers could be written in native code.

So the standard way is to ship javac wrapper with JDK.

Jayesh
  • 402
  • 1
  • 4
  • 22
Rekin
  • 9,731
  • 2
  • 24
  • 38
  • 2
    So here's the scenario from what you described, we write source code. javac.exe the executable, primarily exists to collect parameters and to start the JVM. The JVM then executes compiler(collection of large number of already compiled .class files) code. The compiler then executes our written program. Also when the compiler was being designed, some other language(C in our case) is used to create the .class files for the compiler. am i right with what i said? .exe file is a wrapped java bytecode? It isn't exactly wrapping code. what kind of complicated batch scripts are you talking about? – nash Aug 04 '10 at 18:10
  • 3
    Correct, `javac` is only a convenient wrapper. The real compiler code is in `com.sun.tools.javac` package of Sun's JDK. You can go there and take a look at the source code - it can be interesting to see the internals. You can also invoke `javac` compiler straight from Java, without calling any external processes. If You take that into account, it makes sense that `javac` is only a launcher of certain classfile buried inside JDK. I can only suspect, but I believe it does as little as starting JVM and passing the arguments. – Rekin Aug 04 '10 at 18:33
  • And about the batch script, take a look at the last paragraph of the answer. I edited it to answer Your comment. – Rekin Aug 04 '10 at 18:45
  • Ok thanks a lot, appreciate your efforts for the clear reply. – nash Aug 04 '10 at 18:57
  • A tokenizer tokenizes. It doesn't produce ASTs or syntax errors. You don't know whether `javac` uses the Visitor pattern or not. – user207421 Feb 15 '17 at 07:27
  • 1
    @EJP afair, those are the notes from reading the openjdk's javac source code – Rekin Feb 19 '17 at 09:24
16

Isn't the java compiler written in java, then how come there is .exe file which executes it?

Where do you get this information from? The javac executable could be written in any programming language, it is irrelevant, all that is important is that it is an executable which turns .java files into .class files.

For details on the binary specification of a .class file you might find these chapters in the Java Language Specification useful (although possibly a bit technical):

You can also take a look at the Virtual Machine Specification which covers:

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
matt b
  • 138,234
  • 66
  • 282
  • 345
  • Will check out the links. Plus the scenario i added as a comment to rekin's answer. Am i right with it? – nash Aug 04 '10 at 18:21
11

The compiler javac.exe is a .exe file. What exactly is this .exe file? Isn't the java compiler written in java, then how come there is .exe file which executes it?

The Java compiler (at least the one that comes with the Sun/Oracle JDK) is indeed written in Java. javac.exe is just a launcher that processes the command line arguments, some of which are passed on to the JVM that runs the compiler, and others to the compiler itself.

If the compiler code is written is java, then how come compiler code is executed at the compilation stage, since its the job of the jvm to execute java code. How can a language itself compile its own language code? It all seems like chicken and egg problem to me.

Many (if not most) compilers are written in the language they compile. Obviously, at some early stage the compiler itself had to be compiled by something else, but after that "bootstrapping", any new version of the compiler can be compiled by an older version.

Now what exactly does the .class file contain? Is it a abstract syntax tree in text form, is it tabular information, what is it?

The details of the class file format are described in the Java Virtual Machine specification.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • "The Java compiler (at least the one that comes with the Sun/Oracle JDK) is indeed written in Java" I did not know this! Thank you, sir. – ZoFreX Aug 04 '10 at 16:23
6

Well, javac and the jvm are typically native binaries. They're written in C or whatever. It's certainly possible to write them in Java, just you need a native version first. This is called "boot strapping".

Fun fact: Most compilers that compile to native code are written in their own language. However, they all had to have a native version written in another language first (usually C). The first C compiler, by comparison, was written in Assembler. I presume that the first assembler was written in machine code. (Or, using butterflies ;)

.class files are bytecode generated by javac. They're not textual, they're binary code similar to machine code (but, with a different instruction set and architechture).

The jvm, at run time, has two options: It can either intepret the byte code (pretending to be a CPU itself), or it can JIT (just-in-time) compile it into native machine code. The latter is faster, of course, but more complex.

Mike Caron
  • 14,351
  • 4
  • 49
  • 77
  • Thanks for the XKCD .. hadn't seen that one :) – David J. Liszewski Aug 04 '10 at 15:23
  • Technically, the first compiler could be created by an *interpreter* written in some other language, though the lines between interpretation and JIT compilation are a little blurry as ultimately they both produce native machine code. – Andrzej Doyle Aug 04 '10 at 15:30
  • True. In fact, there are even crazier options. The first C++ compiler was simply a translator that produced C code for use withe a C compiler. – Mike Caron Aug 04 '10 at 15:38
  • Huh? Why did something vote this down? Is there a problem with my answer? – Mike Caron Aug 04 '10 at 15:44
  • 1
    @Mike: Well, javac is not written in C, it's written in Java, so your first sentence is not entirely correct. Otherwise a fine answer :-). – sleske Aug 04 '10 at 18:49
  • @sleske Is javac written in Java? I actually didn't know that. It makes sense, I guess. But, I bet javac 1.0 wasn't written in Java ;) – Mike Caron Aug 04 '10 at 20:00
3

The .class file contains bytecode which is sort of like very high-level Assembly. The compiler could very well be written in Java, but the JVM would have to be compiled to native code to avoid the chicken/egg problem. I believe it is written in C, as are the lower levels of the standard libraries. When the JVM runs, it performs just-in-time compilation to turn that bytecode into native instructions.

ZoFreX
  • 8,812
  • 5
  • 31
  • 51
3

Short Explanation

Write code on a text editor, save it in a format that compiler understands - ".java" file extension, javac (java compiler) converts this to ".class" format file (byte code - class file). JVM executes the .class file on the operating system that it sits on.

Long Explanation

Always remember java is not the base language that operating system recognizes. Java source code is interpreted to the operating system by a translator called Java Virtual Machine (JVM). JVM cant understand the code that you write in a editor, it needs compiled code. This is where a compiler comes into picture.

Every computer process indulges in memory manipulation. We cant just write code in a text editor and compile it. We need to put it in the computer's memory, i.e save it before compiling.

How will the javac (java compiler) recognize the saved text as the one to be compiled? - We have a separate text format that the compiler recognizes, i.e .java. Save the file in .java extension and the compiler will recognize it and compile it when asked.

What happens while compiling? - Compiler is a second translator(not a technical term) involved in the process, it translates user understood language(java) into JVM understood language(Byte code - .class format).

What happens after compiling? - The compiler produces .class file that JVM understands. The program is then executed, i.e the .class file is executed by JVM on the operating system.

Facts you should know

1) Java is not multi-platform it is platform independent.

2) JVM is developed using C/C++. One of the reason why people call Java a slower language than C/C++

3) Java byte code (.class) is in "Assembly Language", the only language understood by JVM. Any code that produces .class file on compilation or generated Byte code can be run on the JVM.

1

Windows doesn't know how to invoke Java programs before installing a Java runtime, and Sun chose to have native commands which collect arguments and then invoke the JVM instead of binding the jar-suffix to the Java engine.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
  • Not in CMD.EXE as far as I know. You cannot just say "foobar.jar" on the commadn line and have it executed. This might be a Windows 95/98/ME limitation in COMMAND.EXE that resulted in that decision. These days it would be nice though. – Thorbjørn Ravn Andersen Aug 04 '10 at 19:37
-1

The compiler was originally written in C with bits of C++ and I assume that it still is (why do you think the compiler is written in Java as well?). javac.exe is just the C/C++ code that is the compiler.

As a side point you could write the compiler in java, but you're right, you have to avoid the chicken and egg problem. To do this you'd would typically write one or more bootstrapping tools in something like C to be able to compile the compiler.

The .class file contains the bytecodes, the output of the javac compilation process and these are the instructions that tell the JVM what to do. At runtime these bytecodes have are translated to native CPU instructions (machine code) so they can execute on the specific hardware under the JVM.

To complicate this a little, the JVM also optimises and caches machine code produced from the bytecodes to avoid repeatedly translating them. This is known as JIT compilation and occurs as the program is running and bytecodes are being interpreted.

Paolo
  • 22,188
  • 6
  • 42
  • 49
  • 2
    It's widely known that the (Sun/Oracle) Java compiler is written in Java. Since Java 6, there's even a sufficial API to it, but long before that, people called it via the unofficial API in the sun.tools packages. – Michael Borgwardt Aug 04 '10 at 15:37
-4
  1. .java file
  2. compiler(JAVA BUILD)
  3. .class(bytecode)
  4. JVM(system software usually build with 'C')
  5. OPERATING PLATFORM
  6. PROCESSOR
Machavity
  • 30,841
  • 27
  • 92
  • 100