I have a confusion when differentiating between Source code, Object code, Assembly code, and Machine code

Question

I read every where that we write source code (High level language), the compilers converts it into a machine code (Low level language). Then i read that there is an assembler, which converts assembly code into a machine code. Then When differentiating compiler and interpreter, i read that compiler first converts whole code into object code while interpreter directly converts into machine code by skipping object code. Now i have confusions and i got the following questions in mind:

From where the assembly code comes out, if compilers directly converts source code into machine code?
What is difference between object code and machine code?
Who converts source code to assembly code?
What is high level and low level language, how to differentiate them?
Assembly code and object code are high level or low level?

Related: https://stackoverflow.com/questions/466790/assembly-code-vs-machine-code-vs-object-code — Arnav Borborah, Feb 28 '18 at 13:48
1. Nowhere. That setup does not involve assembly. 3. The compiler. 5. Low level. — Jester, Feb 28 '18 at 14:00
1. Many compilers do have an option (*e.g.*, `-S` for `gcc`) that will output the assembly code corresponding to the source code it is compiling. But without such an option stipulated, it may or may not go to an intermediate "assembly" code form. That probably depends upon the compiler. So, 3. the compiler if you ask it nicely. 4. can be a matter of opinion to some degree, but a higher level language is further abstracted from the hardware behavior, while lower level language has less abstraction. Assembly language is decidedly lower level since it directly corresponds to machine instructions. — lurker, Feb 28 '18 at 15:00
2. Machine code consists of binary data directly interpretable by the machine. It's ready to load somewhere into memory and run. Object code has machine codes in it, but other information about memory locations for data and code, and tables that define symbols that help the loader to relocate the code to a location in memory chosen at a later point in time (oversimplifying a bit). — lurker, Feb 28 '18 at 15:06
As a bonus.. there are languages with abstract virtual "machine code", like Java, which is compiled into java opcodes (".class" files). And those are executed on virtual java machines, which are sort-of interpreters/compilers interpreting or compiling the java opcodes into target native machine code depending on which real machine the java virtual machine is being run. So from Java language point of view the "class" files are final binary (like machine code), from the CPU point of view those are just another data file to be processed by interpreter/compiler. — Ped7g, Feb 28 '18 at 15:18

score 3 · Answer 1 · answered Feb 28 '18 at 14:35

There is no simple answer to most of your questions as it can vary from compiler to compiler. Some compilers emit other high level languages such as C.

Generally for compilers that use an assembler the backend will emit a temporary asm file which the assembler converts to object code. If you have access to GCC you can see the chain of commands that it uses with the -v option. For instance, for the C source

int main(){ return 1; }

the command

gcc -v -o test test.c

outputs (and I've filtered a lot)

cc1 test.c -o /tmp/cc9Otd7R.s
as -v --64 -o /tmp/cc5KhWEM.o /tmp/cc9Otd7R.s
collect2 --eh-frame-hdr -m elf_x86_64 -o test /tmp/cc5KhWEM.o

For me object code is the binary code emitted in the format required for the machine and OS architecture. For instance, this may be in the ELF format arranged in sections. Machine code is just the binary representation of the assembler. For instance this bit of disassembly

48 83 ec 10 sub rsp,0x10

The first four words are the 4 bytes of the machine code, followed by the assembler.

As per point 1, this would be the compiler backend.
and 5. This is somewhat subjective, but assembly is low level. You don't normally modify object code by hand (I have on occasion done so with a hex editor but such changes are generally very small)

old_timer · Answer 2 · 2018-03-01T03:39:12.497

An assembler takes assembly language, processor instructions that are easier for humans to read and write, and turns those into machine code, or binary versions of those instructions.

assembly language vectors.s

.thumb

.globl _start
_start:
.word 0x20001000
.word reset
.word foo
.word foo
.word foo
.word foo
.word foo
.word foo

.thumb_func
reset:
    bl fun
.thumb_func
foo:
    b foo

.globl dummy
dummy:
    bx lr

assemble then disassemble

arm-none-eabi-as vectors.s -o vectors.o
arm-none-eabi-objdump -D vectors.o > vectors.list

related portion of the disassembly

Disassembly of section .text:

00000000 <_start>:
   0:   20001000
    ...

00000020 <reset>:
  20:   f7ff fffe   bl  0 <fun>

00000024 <foo>:
  24:   e7fe        b.n 24 <foo>

00000026 <dummy>:
  26:   4770        bx  lr

The .words are not instructions those are ways to put data in the binary/output. In this case I am generating a vector table. The disassembler is not showing everything yet, we will see the rest. The assembler has left placeholders which we will see shortly for the linker to fill in. So this is what an object looks like the assembly has been turned into machine code. assembly bx lr, machine code 0x4770

There are exceptions to the rule, generally for specific reasons, but it generally does not make sense to have a compiler compile to machine code directly. You have to have an assembler for a target, so that is already there, use it. It is far easier for a compiler writer to debug assembly code than to debug machine code. There are some exceptions, there is the "just because I want to" kind of like why did you climb the mountain instead of go around "because it was there". And then there is the just in time reason, and some others. JIT needs to get to machine code sooner and or with one tool/library/driver/etc...So you may see those skip the step, it is harder to develop. often you can test this theory by renaming your assembler (have to hit the right binary though, the one you run on the command line may be a front for the real one, actually in the case of gcc I think gcc the program we use is just a front for cc1 and perhaps another program or two and the assembler and linker, all spawned from gcc unless you tell it not to).

so we take our simple entry program

#define FIVE 5
unsigned int more_fun ( unsigned int );
void fun ( void )
{
    more_fun(FIVE);
}

compile

arm-none-eabi-gcc -mthumb -save-temps -O2 -c fun.c -o fun.o
arm-none-eabi-objdump -D fun.o > fun.list

the first temp is the pre-processor taking the #defines and #includes and basically getting rid of them, producing the file that will be sent to the compiler

# 1 "fun.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "fun.c"


unsigned int more_fun ( unsigned int );
void fun ( void )
{
    more_fun(5);
}

Then the compiler itself is called and that compiles to assembly language

    .cpu arm7tdmi
    .fpu softvfp
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .eabi_attribute 26, 1
    .eabi_attribute 30, 2
    .eabi_attribute 34, 0
    .eabi_attribute 18, 4
    .code   16
    .file   "fun.c"
    .text
    .align  2
    .global fun
    .code   16
    .thumb_func
    .type   fun, %function
fun:
    push    {r3, lr}
    mov r0, #5
    bl  more_fun
    @ sp needed
    pop {r3}
    pop {r0}
    bx  r0
    .size   fun, .-fun
    .ident  "GCC: (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)"

Then the assembler is called to turn that into an object, which we can see here in the disassembly of the object what was produced:

Disassembly of section .text:

00000000 <fun>:
   0:   b508        push    {r3, lr}
   2:   2005        movs    r0, #5
   4:   f7ff fffe   bl  0 <more_fun>
   8:   bc08        pop {r3}
   a:   bc01        pop {r0}
   c:   4700        bx  r0
   e:   46c0        nop         ; (mov r8, r8)

Now the bl 0 is not yet real, more_fun is an external label so the linker will have to come in and fix this as we will see shortly.

more_fun.c same story

source code

#define ONE 1
unsigned int more_fun ( unsigned int x )
{
    return(x+ONE);
}

compiler input

# 1 "more_fun.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "more_fun.c"


unsigned int more_fun ( unsigned int x )
{
    return(x+1);
}

compiler output (assembler input)

    .cpu arm7tdmi
    .fpu softvfp
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .eabi_attribute 26, 1
    .eabi_attribute 30, 2
    .eabi_attribute 34, 0
    .eabi_attribute 18, 4
    .code   16
    .file   "more_fun.c"
    .text
    .align  2
    .global more_fun
    .code   16
    .thumb_func
    .type   more_fun, %function
more_fun:
    add r0, r0, #1
    @ sp needed
    bx  lr
    .size   more_fun, .-more_fun
    .ident  "GCC: (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)"

disassembly of the object (assembler output)

Disassembly of section .text:

00000000 <more_fun>:
   0:   3001        adds    r0, #1
   2:   4770        bx  lr

Now we link all of these together (there is a reason why it is called a toolchain, compile, assemble, link a series of tools chained together, the outputs of one feed the input of the other)

arm-none-eabi-ld -Ttext=0x2000 vectors.o fun.o more_fun.o -o run.elf
arm-none-eabi-objdump -D run.elf > run.list
arm-none-eabi-objcopy -O srec run.elf run.srec


Disassembly of section .text:

00002000 <_start>:
    2000:   20001000 
    2004:   00002021 
    2008:   00002025 
    200c:   00002025 
    2010:   00002025 
    2014:   00002025 
    2018:   00002025 
    201c:   00002025 

00002020 <reset>:
    2020:   f000 f802   bl  2028 <fun>

00002024 <foo>:
    2024:   e7fe        b.n 2024 <foo>

00002026 <dummy>:
    2026:   4770        bx  lr

00002028 <fun>:
    2028:   b508        push    {r3, lr}
    202a:   2005        movs    r0, #5
    202c:   f000 f804   bl  2038 <more_fun>
    2030:   bc08        pop {r3}
    2032:   bc01        pop {r0}
    2034:   4700        bx  r0
    2036:   46c0        nop         ; (mov r8, r8)

00002038 <more_fun>:
    2038:   3001        adds    r0, #1
    203a:   4770        bx  lr

the linker has adjusted the external label, in this case by modifying the instruction for the correct offset.

   4:   f7ff fffe   bl  0 <more_fun>
202c:   f000 f804   bl  2038 <more_fun>

The elf file format is one type of "binary" file, it is binary in that you open it with a text editor you see some text but mostly garbage. There are other "binary" file formats like a motorola s-record, which in this case only includes the real stuff, the machine code and any data, where the elf has debug information like the strings "fun" "more_fun", etc that the disassembler happened to have used to make the output a little prettier. Motorola S-Record and Intel Hex are ascii file formats like this:

S00B000072756E2E73726563C4
S113200000100020212000002520000025200000D1
S113201025200000252000002520000025200000A8
S113202000F002F8FEE7704708B5052000F004F858
S10F203008BC01BC0047C04601307047EA
S9032000DC

Not used as much anymore but not completely useless, used to need this format to program a rom, personal preference of tools makers as to what file formats they support. How does the binary get burned into a flash in a microcontroller? Some tool takes those bits from the host/development machine and through some interface and some software moves it to the target, what binary file formats does that tool support? Up to whomever wrote the tool to choose one or more formats.

Back before compilers were affordable in various numbers of ways (both cost to buy and/or storage space to hold the program on your computer, plus intermediate data, etc) assemblers could be used to make a whole program. You see directives like .org 100h, with a "toolchain" the assembler might have that feature but as part of the chain the assembler tool needs to get from assembly language to an object format, most of the conversion to machine code and other data. Certainly possible that a compiler could do all the work and output a finished binary, when part of a toolchain the sane method is to ultimately get from the source code to assembly language. The compiler tools we are used to, gcc, msvc, clang, etc, unless told not too will spawn the assembler and the linker for us as well as the compiler making it seem like the compiler went from source to final binary in one magical step. The linker takes individual objects who some have unresolved external labels, and decides where in the memory image, where in memory, they will go, resolving externals as needed. How much the linker does is very much part of the system design for these tools, the design can be such that the linker does not modify individual instructions it only places addresses in agreed upon places. An example of this:

vectors.s

.globl _start
_start:
    bl fun
    b .
.global hello
hello: .word 0

fun.c

#define FIVE 5
extern unsigned int hello;
void fun ( void )
{
    hello+=FIVE;
}

fun.o disassembly

Disassembly of section .text:

00000000 <fun>:
   0:   e59f200c    ldr r2, [pc, #12]   ; 14 <fun+0x14>
   4:   e5923000    ldr r3, [r2]
   8:   e2833005    add r3, r3, #5
   c:   e5823000    str r3, [r2]
  10:   e12fff1e    bx  lr
  14:   00000000    andeq   r0, r0, r0

so we can see that it is loading from offset/address 0x14 a number into r2 then that number is used as an address to get hello, then what was read has 5 added to it then the address in r2 is used to save hello back to memory. So what is at 0x14 is a placeholder left by the compiler so the linker can place the address to hello there, which we see once linked

Disassembly of section .text:

00002000 <_start>:
    2000:   eb000001    bl  200c <fun>
    2004:   eafffffe    b   2004 <_start+0x4>

00002008 <hello>:
    2008:   00000000    andeq   r0, r0, r0

0000200c <fun>:
    200c:   e59f200c    ldr r2, [pc, #12]   ; 2020 <fun+0x14>
    2010:   e5923000    ldr r3, [r2]
    2014:   e2833005    add r3, r3, #5
    2018:   e5823000    str r3, [r2]
    201c:   e12fff1e    bx  lr
    2020:   00002008    andeq   r2, r0, r8

0x2020 now holds the address to hello, the compiler built the program such that this address could easily be filled in by the linker and the linker filled it in. It is possible certainly to do this with branch/jump addresses, and different toolchains or different targets from the same tools will produce different solutions, it usually has to do with the instruction set. You have one with a near call (relative) and a far call (absolute), do you compile externals with a far call so it always works? Or do you take your chances and build for a near call and take the risk the linker has to put a trampoline in?

Not that exact thing but I can make gcc do this for thumb/arm fairly easily.

.thumb
.globl _start
_start:
    bl fun
    b .
.global hello
hello: .word 0


#define FIVE 5
extern unsigned int hello;
void fun ( void )
{
    hello+=FIVE;
}

disassembly of linked binary

00002000 <_start>:
    2000:   f000 f812   bl  2028 <__fun_from_thumb>
    2004:   e7fe        b.n 2004 <_start+0x4>

00002006 <hello>:
    2006:   00000000    andeq   r0, r0, r0
    ...

0000200c <fun>:
    200c:   e59f200c    ldr r2, [pc, #12]   ; 2020 <fun+0x14>
    2010:   e5923000    ldr r3, [r2]
    2014:   e2833005    add r3, r3, #5
    2018:   e5823000    str r3, [r2]
    201c:   e12fff1e    bx  lr
    2020:   00002006    andeq   r2, r0, r6
    2024:   00000000    andeq   r0, r0, r0

00002028 <__fun_from_thumb>:
    2028:   4778        bx  pc
    202a:   46c0        nop         ; (mov r8, r8)
    202c:   eafffff6    b   200c <fun>

Because the way this specific instruction set works you cant get from thumb code to arm code using the bl (basically call) instruction, you have to use bx which is just a branch (jump) not a call, the linker placed a trampoline, some code used to bounce from one thing to another in for us.

Not all instruction sets are easy to disassemble and/or the toolchain doesnt include one, its not a required part of a toolchain. But you can and should repeat this using gnu and other tools for this or other targets, as you can see I dont have to have special hardware, I dont have to write but more than a dozen lines of code to see these tools at work.

assembly is low level it has an ideal but as you can see not really one to one relationship with machine code. Machine code is low level. C, Python, C++, etc these are high level languages, does it need a compiler? High level. An assembler? Low level. But there are exceptions and there are folks that will argue this topic. High and Low are relative terms and as such subject to the opinion of the viewer. — old_timer, Mar 01 '18 at 03:32

Adi219 · Answer 3 · 2018-02-28T14:13:31.967

-1

All apart from source code are low-level languages.

I believe object and machine code refer to the same thing.

There is no direct conversion from source to assembly code as source code is generally converted directly to machine code. An assembler can be used to convert assembly code to machine code (assembly language has a 1:1 correspondence with machine code). A compiler is used to convert source code directly into machine code.

Assemblers are used because, as machine code is different for each type of computer, assembly languages are also specific to each type of computer.

A high-level language is one where we use abstract low-level languages into easy-to-read-and-understand code. It is an abstraction to help us be more productive whilst coding.

A low-level language is one where there is little or no abstraction from a computer's instruction set.

edited Feb 28 '18 at 14:13

answered Feb 28 '18 at 13:53

Adi219

4,712
2
20
43

Your answer is satisfying but not complete. – Feb 28 '18 at 14:00
1. If a source code is generally converted into a machine code, then why there is an assembler? – Feb 28 '18 at 14:03
2. Who write assembly language? – Feb 28 '18 at 14:04
"A high-level language is one where we use words and numbers" In C and C++ we use words and numbers; but they're both low level languages. I don't think you described this very well; since it's actually relative. "high and low" are abstract ideas, not absolutes. – UKMonkey Feb 28 '18 at 14:09
1

If your compiler directly produces machine code, then you don't need an assembler. – Jester Feb 28 '18 at 14:19
Who writes assembler? Some examples: for very small embedded devices; for features not supported by high level languages like runtime exception handling; for features not yet supported by your compiler like the latest vector extensiosn – Paul Floyd Feb 28 '18 at 14:44
2

Assembly file is for me still "source-file", because that's what I'm editing by hand, and what is at the start of the chain producing the final binary/executable/whatever. I mean in the case, that I'm programming in assembly; when the assembly file is like result of compiling C file, then it is "intermediate-build-file", not source-file. – Ped7g Feb 28 '18 at 15:13

I have a confusion when differentiating between Source code, Object code, Assembly code, and Machine code

3 Answers3