0

I work with high level languages, and I'm trying to understand the hierarchy of low level languages.

I know that different micro-processors talk in different languages(please correct me if I'm wrong in this assumption), but are we saying that they speak different binary or assembly commands?

Is all binary code the same? Meaning will a set of binary instructions execute the same command on every single CPU or micro-processor?

Thanks everybody. I've been researching this, and I can't find a clear answer anywhere.

alexr101
  • 588
  • 5
  • 13
  • 2
    The answers seem pretty clear to me, for example if you look at MIPS and x86, their machine code is completely different. – harold May 07 '17 at 01:03
  • Thanks harold, I'm just not clear most of this stuff, and still not sure if languages like MIPS or IA-32 are in fact machine code or assembly. I'm not studying this in depth, just trying to grasp basic concepts. – alexr101 May 07 '17 at 01:26
  • 1
    you can find the answer if you look up the instruction sets or machine code for the various processors you know the names of. – old_timer May 07 '17 at 01:49
  • "Assembly" is the human-readable form of binary. Every processor architecture has its own binary code, as you know, created by its designers. This is known as "machine language". Since binary (machine language) is hard to program in, the designers also create mnemonics for those binary operation codes (opcodes), and that's known as "assembly language". They are otherwise the same. An assembler just translates the assembly-language mnemonics directly into binary bits, corresponding to the exact same instructions. So no, not all architectures are the same. – Cody Gray - on strike May 07 '17 at 07:50
  • I understand that this can be kind of confusing to someone new to the field, but I'm not convinced that you did very much of your own research before asking the question. For example, [Wikipedia](https://en.wikipedia.org/wiki/Assembly_language) seems to have a pretty clear answer to this. I don't see how you could have read that article and still been confused. And Wikipedia barely even counts as "research". – Cody Gray - on strike May 07 '17 at 07:51

3 Answers3

3

No an ARM executes ARM instructions a MIPS processor executes MIPS, and so on. There are many many different incompatible instruction sets. A term you can use is machine code or machine language, which is the binary, the bits that make the processor run. Assembly language is ideally a one to one set of human readable mnemonics, a text language easier to program and read than machine code. An assembler takes the assembly language and turns it into machine code.

so take this simple function

unsigned char fun ( unsigned char a, unsigned char b )
{
    return(a+b+3);
}

An arm implementation might be

00000000 <fun>:
   0:   e2811003    add r1, r1, #3
   4:   e0800001    add r0, r0, r1
   8:   e20000ff    and r0, r0, #255    ; 0xff
   c:   e12fff1e    bx  lr

the machine code is the 0xe2811003 part and the assembly language that has a one to one relationship to that instruction is add r1,r1,#3 this processor has registers r0,r1,r2. This compiler conforms to a convention that says the first parameter is passed in r0 the second in r1, so a is in r0, and b in r1, and we need to return in r0 so we add 3 to r1 then we add r1 (which is now b+3) to r0 (which is a) and save it in r0, so r0 now holds a+b+3 since this is unsigned char math, 8 bits, we need to and with 0xFF to keep the result an unsigned char, and then return.

I say one way because with this same code and compiler I can change the compiler options and get

00000000 <fun>:
   0:   e52db004    push    {r11}       ; (str r11, [sp, #-4]!)
   4:   e28db000    add r11, sp, #0
   8:   e24dd00c    sub sp, sp, #12
   c:   e1a03000    mov r3, r0
  10:   e1a02001    mov r2, r1
  14:   e54b3005    strb    r3, [r11, #-5]
  18:   e1a03002    mov r3, r2
  1c:   e54b3006    strb    r3, [r11, #-6]
  20:   e55b2005    ldrb    r2, [r11, #-5]
  24:   e55b3006    ldrb    r3, [r11, #-6]
  28:   e0823003    add r3, r2, r3
  2c:   e20330ff    and r3, r3, #255    ; 0xff
  30:   e2833003    add r3, r3, #3
  34:   e20330ff    and r3, r3, #255    ; 0xff
  38:   e1a00003    mov r0, r3
  3c:   e28bd000    add sp, r11, #0
  40:   e49db004    pop {r11}       ; (ldr r11, [sp], #4)
  44:   e12fff1e    bx  lr

which is an unoptimized version of the same it also implements the C code we asked it to it just is...not optimized... The difference between -O2 and -O0 on the command line.

an x86 version of our simple function

0000000000000000 <fun>:
   0:   8d 44 3e 03             lea    0x3(%rsi,%rdi,1),%eax
   4:   c3                      retq   

one I like to throw in to see if folks know what it is

00000000 <_fun>:
   0:   1166            mov r5, -(sp)
   2:   1185            mov sp, r5
   4:   9d40 0006       movb    6(r5), r0
   8:   65c0 0003       add $3, r0
   c:   9d41 0004       movb    4(r5), r1
  10:   6040            add r1, r0
  12:   1585            mov (sp)+, r5
  14:   0087            rts pc

msp430

00000000 <_fun>:
   0:   1166            mov r5, -(sp)
   2:   1185            mov sp, r5
   4:   9d40 0006       movb    6(r5), r0
   8:   65c0 0003       add $3, r0
   c:   9d41 0004       movb    4(r5), r1
  10:   6040            add r1, r0
  12:   1585            mov (sp)+, r5
  14:   0087            rts pc

and back to arm, arm has a 16 bit instruction set called thumb

00000000 <fun>:
   0:   3103        adds    r1, #3
   2:   1840        adds    r0, r0, r1
   4:   0600        lsls    r0, r0, #24
   6:   0e00        lsrs    r0, r0, #24
   8:   4770        bx  lr

So hopefully it is very clear that machine code is in no way universal, and in fact neither are compilers there is more than one way to compile the same high level code to assembly language. Even for the same target with the same compiler.

Note I say compile to assembly language, it is a very common thing to do that, you already have an assembler and linker, compiling to machine code is hard to read so hard to debug for the compiler authors, no reason to do that when you already have an assembler. This is why they are called toolchains. Very common, when you run gcc -o hello hello.c MANY programs are run, just the gcc cmopiler itself is a few programs that execute in order leaving temporary files behind for the next program, then eventually the assembler is called (unless you specified -S and it just stops with assembly language) to assemble it into an object then gcc cleans up the temporary files. Again fairly common, this is why it is called a toolchain, compiler to assembler to linker, a chain a sequence of programs that are run in order.

With gcc for example if I put --save-temps on the command line

so.i

# 1 "so.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "so.c"
unsigned char fun ( unsigned char a, unsigned char b )
{
    return(a+b+3);
}

so.s

    .cpu arm7tdmi
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .eabi_attribute 26, 1
    .eabi_attribute 30, 2
    .eabi_attribute 34, 0
    .eabi_attribute 18, 4
    .file   "so.c"
    .text
    .align  1
    .p2align 2,,3
    .global fun
    .syntax unified
    .code   16
    .thumb_func
    .fpu softvfp
    .type   fun, %function
fun:
    adds    r1, r1, #3
    adds    r0, r0, r1
    lsls    r0, r0, #24
    lsrs    r0, r0, #24
    @ sp needed
    bx  lr
    .size   fun, .-fun
    .ident  "GCC: (GNU) 6.3.0"

and then it makes the object which is a binary we can use objdump to see as above.

Being a very boring program/function that wasnt very exciting, but if you had includes and includes had includes and so on, one of these intermediate files would be a really big single file with all the includes expanded out so the real compiler only has to work on the one file.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • 1
    also as you may know successful as in the company doesnt go under, processor familys evolve over time the 8088/86 to 80186 to 80286, to 80386 to 80486 and so on to the present. With x86 we went from a 16 bit to 32 bit to 64 bit processor with new stuff added each time, new instructions and other features. with ARM there has been an evolution with instructions added each generation up to ARMv8 being a complete do over of the instruction set for the 64 bit solution. MIPS, etc. – old_timer May 07 '17 at 01:39
  • 1
    Often they will take a bit pattern that was formerly an undefined instruction, and turn that into an instruction or into a prefix that expands to another pool of instructions. so not only does a program compiled for x86 not run on an ARM it might not run on another x86 if it is from a different generation or is in a different mode. – old_timer May 07 '17 at 01:41
1

Binary is a form of numeric representation, along with decimal and hexidecimal. To refer to code as binary is to refer to the way that CPU instructions (machine code or object code) and data such as memory addresses are represented on the hardware level using transistors and the like.

CPUs may have different instruction sets, such as Intel's x86, ARM, MIPS, etc.

Here is an example of x86-64 instructions being represented as hexadecimal values by the disassembler objdump:

$ objdump -dj .text test | grep -A12 "<main>:"
00000000004004f9 <main>:
  4004f9:   55                      push   %rbp
  4004fa:   48 89 e5                mov    %rsp,%rbp
  4004fd:   48 83 ec 10             sub    $0x10,%rsp
  400501:   c7 45 f8 0a 00 00 00    movl   $0xa,-0x8(%rbp)
  400508:   8b 45 f8                mov    -0x8(%rbp),%eax
  40050b:   89 c7                   mov    %eax,%edi
  40050d:   e8 db ff ff ff          callq  4004ed <test>
  400512:   89 45 fc                mov    %eax,-0x4(%rbp)
  400515:   8b 45 fc                mov    -0x4(%rbp),%eax
  400518:   c9                      leaveq 
  400519:   c3                      retq   
  40051a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

The memory addresses (leftmost column) and the hexadecimal values of operation codes and operands (middle column) could also be represented in binary or in decimal (base 2 and base 10, respectively).

I know that different micro-processors talk in different languages(please correct me if I'm wrong in this assumption), but are we talking about assembly or binary?

Assembly language can be represented as binary values, or as hexadecimal values (see disassembly above) or as human-readable mnemonics (rightmost column above).

To make this clearer, here is a snapshot of an Intel x86 Assembler Instruction Set Opcode Table:

Intel x86 Assembler Instruction Set Opcode Table

Is all binary code the same? Meaning will a set of binary instructions execute the same command on every single CPU or micro-processor?

Executable code must be represented in a manner that conforms to a CPU's instruction set. For example, a MIPS processor cannot execute x86 code an an x86 processor cannot execute MIPS code. There is no universal instruction set.

Community
  • 1
  • 1
julian
  • 357
  • 1
  • 8
  • 17
  • Thank this clears it up. While I'm not looking to get too much into low level languages it does clarify the concept more, and address my specific areas of doubt. – alexr101 May 07 '17 at 01:34
  • 1
    Chapters 1, 2 and 3 of [Computer Systems: A Programmer's Perspective](http://csapp.cs.cmu.edu/) discuss this in depth, in case you are interested. – julian May 07 '17 at 01:40
  • Definitely, thanks! – alexr101 May 07 '17 at 03:18
0

Assembly is just a low level language which human understand, The real machine code is in binary that you can translate in some assembly and then if like you can convert to some high level language like C.

Here is a simple example that translate a machine language code (0x2237FFF1) into MIPS assembly.

0x2237FFF1 this in hexadecimal

To binary

0010 0010 0011 0111 1111 1111 1111 0001

Now I'm reading the opcode (001000) and know that it is I-type and addiinstruction

Now I'm grouping the binary into I-type instruction

   op     rs    rt       imm
 001000 10001 10111 1111111111110001
   8      17    23        -15

Looking at the MIPS reference sheet and found out that the instruction must be

 addi $s7,$s1,-15

If like to go forward you can convert it to C and it is a simple addition.

Adam
  • 856
  • 2
  • 9
  • 18