6

I am really into understanding programming from the bottom up. So, I have learned the internal construction of a tiny 64kb computer, because I'm super interested in understanding computers from the transistor level. I understand transistors, the creation of multiplexers, decoders, creation of ALU, etc.

I get that for LC3, which is what I learned, opcodes like 0001 011 011 100001 etc will mean that the 0001 will get decoded as an Add instruction etc. Yet, I am confused as to how we can write assembly to lead to this. I understand an assembler translates an instruction like ADD R3, R1, R2 and turns it into machine code, but what's really bugging me is how these ASCII characters get "interpreted" into the machine code.

I know at the electronic level how such an instruction is processed, like JMP to change the Program counter etc, but how at the rudimentary level, how do the assembly instructions turn into machine/binary? I do not get how it goes from assembly to machine code.

I couldn't find much online but a theory I came up with is that the typed keys actually just send an electrical signal which is actually binary, yet still don't get how the computer architecture turns this "ADD" into 0001, as it would need to understand the ADD in its entirety, not just binary for what each character is. So, what is the process of turning the assembly into binary that can then control the logic gates, decodes, sign extension etc?

EDIT: For those asking which book I use, it's Introduction to Computing Systems: From Bits and Gates to C and Beyond 2nd Edition (Patt) It goes from building logic gates from P/N transistors to assembly to C. I could not recommend it more for anyone who wants an overview of the entire process.

Michael W
  • 343
  • 2
  • 11
  • 3
    Do you mean you want to learn how an assembler assembles your code? – KYHSGeekCode Mar 26 '18 at 10:54
  • Or you want to know how machine code is actually executed by cpu? – KYHSGeekCode Mar 26 '18 at 10:55
  • 4
    eee... that ASCII text file is turned into binary file during assembling, not in real-time, i.e. there's no need for any simple way to translate `"ADD"` into `00001`, it's done by tens/hundreds of instructions used to write text parser, and assign particular text tokens particular bit-output value, concatenating them together with register selection, into final machine code word, which is then stored in the binary executable (if you are really into this part, check some info about parsing text (algorithms) .. it's no big deal, especially in some high level language). – Ped7g Mar 26 '18 at 10:55
  • My experience of machine code is old and based on experience of 6502 and z80 processors. Assembly language was a simple direct translation of binary machine code. Assembly such as lda 09 (load 09 into the accumulator) Would have hexadecimal or binary direct equivalent operators and operands. I used to code in hex, using a simple hex keypad and display. No mystery to it. – Malcolm Farrelle Mar 26 '18 at 11:01
  • 5
    There just isn't much "process", an assembler is a pretty simple program and they have existed since the 1950s. It merely converts the human readable "JMP" to the 1s and 0s that a processor understands. That conversion is never complicated since opcodes directly map to instructions. It does help you by allowing to use a label instead of an address for the target of the jump, saving you from the hassle of computing the address yourself. The end-result is a binary file, it is the job of the OS to load it into memory so the processor can execute it. – Hans Passant Mar 26 '18 at 11:02
  • "never complicated" .. have you looked at EVEX? :-> – Jester Mar 26 '18 at 11:09
  • 2
    how does your question get interpreted by a human? one letter/word at a time. at assemble time (not runtime) there is a program that parses the text which is just ASCII formated binary in a file and converts that to another binary format which you claimed you understood, for the most part it is exactly the same way you would do it on pencil and paper, look up the opcode and operands in asm, then write down the binary/machine code that matches. Just done with a program. as pointed out dozens/hundreds/thousands of instructions per line/instruction to implement this algorithm/program. – old_timer Mar 26 '18 at 12:05
  • 1
    What is that "tiny 64kb computer" by the way? Maybe one which has a built-in machine language monitor, so it actually already has an assembler (a software) in ROM? (Making it seem a bit like the thing is indeed processing by the text mnemonics) – Jubatian Mar 26 '18 at 12:37

2 Answers2

4

An assembler is a software program that reads text and writes binary. It's not "special" in any way. It doesn't run as you type or anything.

CPUs run machine code stored in RAM or ROM chips. Assemblers are just convenient ways to generate the binary data, which you can then feed into an EEPROM or flash programming machine (for example) to make a chip with code in it. Or if running on the same computer, to assemble into RAM or into a file.

To bootstrap a new platform, you typically write an assembler for it on a different computer, and use that to generate binary files (or ROM / flash chips) containing machine code for the new system.

For a microcontroller, this is the normal workflow; develop on a desktop, build an image (assemble), flash it onto the embedded system with hardware connected to the desktop, then boot it on the microcontroller. With a "toy" computer like LC-3, the process would be the same. You typically wouldn't bother writing an assembler that can run on the LC-3. Although you certainly could; 64kiB of RAM is plently, and I think LC-3 is capable enough with bitwise ops (unlike MARIE or some other over-simplified teaching architectures) that it wouldn't take ridiculous amounts of code to do normal things like encode operands into bits.

The very first assembler to be written had to be written in machine code, maybe on punch cards, or by flipping switches on the console of a machine to create binary codes. Some hardware which humans could interact with, and which produced the desired digital logic 0s and 1s using just hardware. Lots of software was written before the first assembler existed: very early computers were so rare that you didn't use them for one-time text processing tasks; you can do that by hand!

Related: The Story of Mel is an excellent true story about a guy learning programming in assembly, working with an expert veteran who wrote his programs directly in machine-code, on a drum-memory computer in the 1960s. Definitely worth reading, there's an interesting ethical conundrum, too. Anyway, might give you a bit of an idea about programming without an assembler.


Related: How was the first assembler for a new home computer platform written? on retrocomputing.SE has some answers that might help you grok things, and specifically these comments describing the exact process of creating machine code without an assembler:

In the early days memorising binary instructions was useful because a lot of computers allowed you to alter RAM contents using physical switches on a control panel. Indeed, entering a bootloader by directly toggling RAM values was standard boot procedure on some early computers. - slebetman

@slebetman I remember doing that in one of my earlier mainframe operator jobs. We used a manually entered bootloader to load further instructions from punched card, and the punched cards contained a bootstrap that allowed us to load a full OS from a drum hard disk. Good times... – Rob Moir (later in the same thread)


Related stuff about the layers of CPU design between transistor physics and assembly language.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This is a fantastic response! I was using an LC3 simulator to test out assembly code, but it definitely makes sense that you would create an executable with all the machine code which would then be fed into the LC3 itself for whatever our intended result should be. If I understand correctly, a program that counts the occurences of "K" in a file would already be "assembled", created as an executable, then fed in as machine code to the LC3 for the result. Do you have any reading recommendations for learning about how the text from the keyboard gets turned into the actual executable? – Michael W Mar 28 '18 at 20:10
  • @MichaelW: Keyboards don't produce text, they produce scancodes / keycodes, at least the way they're hooked up in PCs. The keyboard interrupt handler in the OS decides what to do on each key-down / key-up event. A thing that you can type on and produce text directly (e.g. shift+`a` => `A` instead of key-down / key-up scancodes for two different keys) is called a *terminal* or teletype. In old-school Unix days, these were physically separate pieces of equipment with screen or printer + keyboard, and send / receive ASCII over a serial port. – Peter Cordes Mar 28 '18 at 21:07
  • 1
    Anyway, many layers of SW between typing and saving text in a `.asm` file, and then you'd run an assembler on that file to get an executable. related stuff about terminals: https://retrocomputing.stackexchange.com/questions/2915/when-you-type-on-a-computer-terminal-how-are-the-characters-displayed-on-the-sc?rq=1 and also https://unix.stackexchange.com/questions/4126/what-is-the-exact-difference-between-a-terminal-a-shell-a-tty-and-a-con. and [How does the keyboard input get into the terminal?](https://stackoverflow.com/questions/6747046/how-does-the-keyboard-input-get-into-the-terminal) – Peter Cordes Mar 28 '18 at 21:07
1

Please refer to comments above first, and you will understand that assembly source file is not converted to binary at runtime. (An assembler merely replaces STRING into some special byte sequences!) Later below I will add some explanations about how our PC executes native byte code.

  1. We press power button. The capacitor related circuit discharges/charges the CPU's reset pin.

  2. CPU resets self. It assigns its program counter to BIOS's boot-up program.

  3. BIOS executes

    BIOS does some essential things to operate our pc.

  4. BIOS loads the boot loader to memory and call it.

    BIOS reads some bytes from Boot record and checks if its 512'th bytes are 0x55 0xAA, which is 0x01010101 10101010b in binary to check if the sector is a boot sector. If it is right the BIOS loads the contents to address 0xC200 and jumps to 0xC200.

  5. Boot loader executes.

    It initializes peripheral devices like PIC, Video card, etc. It setups A20 gate to tell that we want to use more than 1MB memories. Also it loads almost every kernel modules that could not be loaded because of size limit from BIOS. It also changes CPU mode to 32bit or 64bit etc.

  6. Operating system initiates itself.

    It initializes IDT, GDT, timers, data structures for it to operate, and loads/parse filesystem to memory.

  7. Now you see "welcome" message.

  8. Now you create a file named test.asm.

    C:
          XOR EAX, EAX
          NOP
         JMP C
    

    And your test.asm will look like this in binary(hex)

       43 3a 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 20 20 4d 4f 56 20 30 2c 20 52 41 58 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 20 20 4e 4f 50 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 20 4a 4d 50 20 43
    
  9. You assemble this with assembler.

    (I assembled this manually so don't believe my byte codes...)

    Assembler output can be: e.g.

    31 C0 90 EB FC
    

    The point is that your source file bytes and assembled binary files are completely different. (An assembler merely replaces STRING into some special byte sequence!)

    10. And how the bytes are interpreted by CPU: (e.g. Reduced Instruction set Computer, 32 bits.. e.g. old MIPS)

    In short, ALU is just a calculator, and machine language is data that tells the calculator operators and operands. The CPU divides the instruction bytes obtained by referring to the PC register into bits and interprets them. The bits 0 to 5, 0 to 6, and so on of these bits of an instruction tell the calculator what to do (eg, add ADD (eg 001001)). From the 6th or 7th bits, it can be used to specify the operands necessary for this operation. You can specify a register id, a memory address, and a constant. For a simple example, Assume that the instruction id of ADD is 01101, the register AX has an id of 00001, and the instructions of this CPU are of the following 32-bit structure:

     op   rs    rt    rd  shamt funct
     0-5 6-10 11-15 16-20 21-25 26-31
    
     Op is operator id, rs and operand 1,2 respectively , and rd destination of operation. Shamt and funct is used for special purpose .
    

    When you assemble assembly instructions for ADD AX AX AX, the assembler uses the information obtained from this line (op = 011101, rs = 00001, rt = 00001, rd = 00001, shamt = 00000, funct = 0000000)

      01110100001000010000100000000000 (74 31 08 00)
    

    Can be created. The hex editor will show 74 31 08 00, but the CPU reads it as 011101 00001 00001 00001 00000 000000 And selects 011101 as the operator of the ALU and rs and rt of the register 00001 in the register pile as the operand 1 and operand 2 of the ALU respectively. When the ALU completes the calculation, the register file stores the rd value 00001 Record the value. The next CPU adds 4 to the PC register and the process repeats.

So here's a pseudo assembler code. (Just for understanding purpose, it doesn't work at all!) (Intentionally omitted label , jump issues for simplicity)

for(String line: filecontent)
{
    Assemble(line);
}
void Assemble(String line)
{
    String[] parsed=line.split by_comma_or_space();
    String operator=parsed[0] ;
    String operand1=parsed[1];
    String operand2=parsed[2];
    String operand3=parsed[3];
    unsigned int opcode=opcodemap.get(operator);
    unsigned int operand1id=getOperandId(operand1);
    unsigned int operand2id=getOperandId(operand2);
    unsigned int operand3id=getOperandId(operand3);
    unsigned int totalcode=opcode<<32;
    totalcode|=operand1<<26;
    totalcode|=operand2<<21;
    totalcode|=operand3<<16;
    WritetoFile(totalcode);
}

Supplementary readings

RISC/CISC

CPUs can be classified by byte length of instructions. CISC instructions can have variable length of byte sequences for a single instruction. For example, RET is C3 , NOP is 90, and CC for INT 3(1 byte per instruction), but EB xx xx xx xx for JMP xxxxxxxx(5 bytes) and so on. As named complex, its internal structure is complex and hard to implement. Its benefit is that the CISC CPUs can use memory wisely and support numerous instructions that can be executed in one clock cycle.

RISC (Reduced instruction set computer)

Unlike CISC, it has fixed length of an instruction, like 32 bits for an instruction in 32bit computers. What I explained above was CISC. CISC instructions can be easily parsed by bits to specify operator and operands. Its benefit is that the implementation is so simpler than CISC that even I could understand by studing. Its loss is number of operators that can be supported.

More materials for CISC/RISC difference

Sorry, but I know only about 32 bit computers and a newbie in 64 assembly.

I hope my answer was helpful though those are from my short understandings I got when I was curious about whole skeme, from silicon to applications, like you.

KYHSGeekCode
  • 1,068
  • 2
  • 12
  • 30
  • 3
    You put a great effort into making a list of points (that's good!) but each point is grossly oversimplified or incorrect and I don't see how this answer the question. – Margaret Bloom Mar 26 '18 at 12:37
  • @MargaretBloom thanks I am improving mine gradually. – KYHSGeekCode Mar 26 '18 at 12:53
  • 1
    intel syntax has the destination on the left. B8 00 00 00 00 is `mov eax,0`, but [you should have used `xor eax,eax`](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and). Or `pause`. Anyway, that's the output machine code, but for anything other than DOS `.com` files, the output *file* will have metadata before that. – Peter Cordes Mar 26 '18 at 14:15
  • 1
    And no, the bootloader doesn't load kernel modules or "init memory". The BIOS itself does mobo-specific things like that. A bootloader that loads Linux just loads a self-extracting compressed image of the kernel and an initial ramdisk, it doesn't care what's in those images. And of course modern PCs boot with EFI (so the firmware switches to 32-bit or 64-bit mode before running code from the hard drive), with legacy BIOS only as a fallback. Init of most hardware happens in the firmware/BIOS, then again after the kernel boots and its real drivers load. – Peter Cordes Mar 26 '18 at 14:18
  • @PeterCordes XOR EAX,EAX is nicer, but *should*? I thought it performs better but not demanded.. – KYHSGeekCode Mar 26 '18 at 16:21
  • @PeterCordes Thank you but I can't understand much because I learned OS with an outdated ( when 32 bits were standard) book ( In local library, 2014 when I was 15 :( – KYHSGeekCode Mar 26 '18 at 16:33
  • 2
    I said "should" not "must". It's a better choice in every case where it's ok to clobber flags. `mov eax,0` still works. Inside a useless infinite loop it doesn't matter at all, but it's just unusual to use `mov eax,0` instead of zeroing a register the normal way with `xor`. – Peter Cordes Mar 26 '18 at 16:37
  • 1
    Everything I said about bootloaders has been true for the 20 years I've been using Linux, and wasn't new then. EFI is new, but even before that it was still true that hardware init is done by the BIOS (which knows what hardware the motherboard has) and boot-roms on add-in cards. Then the kernel drivers might re-init after boot. But the bootloader just uses portable BIOS interfaces to load the kernel, and doesn't have to know what hardware it's on. – Peter Cordes Mar 26 '18 at 16:40
  • @PeterCordes Oh I didn't know that. But I was not that smart to study linux kernel then.. (I read a book about how to make a program that runs without OS(Haribote) (OS making 30 days project by Kawai Hidemi(Developer of OSASK) Anyway I appreciate your help and comments as I could correct my misunderstandings. – KYHSGeekCode Mar 26 '18 at 16:44