Counting an Intel 8086's clock cycles

Question

I've been working on an Intel 8086 emulator for about a month now. I've decided to start counting cycles to make emulation more accurate and synchronize it correctly with the PIT.

The clock cycles used for each instruction are detailed in Intel's User Manual but I'd like to know how they're calculated. For example, I've deduced the following steps for the XCHG mem8,reg8 instruction - which takes exactly 17 clock cycles according to the manual:

decode the second byte of the instruction: +1 cycle;
transfer first operand from memory into a temporary location: +7 cycles;
transfer second operand from register into memory destination: +8 cycles;
transfer first operand from temporary location into register destination: +1 cycle.

But I'm probably completely wrong as my reasoning doesn't seem to work for all instructions. For instance, I can't comprehend why the PUSH reg instruction takes 11 clock cycles, whereas the POP reg instruction only takes 8 clock cycles.

So, could you tell me how clock cycles are spent in each instruction, or rather a general method to understand where those numbers come from?

Thank you.

@downvoter Could you tell me what was so wrong with my question that you had to downvote me? — neat, Apr 10 '15 at 11:00
@AlexanderZhak I did. But 8086tiny doesn't not count clock cycles, only instructions. — neat, Apr 10 '15 at 11:06
`PUSH` is basically a `MOV` from register to memory. `POP` is a `MOV` from memory to register. From the tables, the former is 9+EA, the latter 8+EA. Since you can POP with 0 EA (stack pointer is already pointing to where you will POP from) this can start immediately and the stack pointer decrement can (I guess) overlap the read cycle once it is no longer needed. For the PUSH operation there is 2 EA since the stack pointer must be incremented before issuing the MOV. I would suppose this is where the extra cycles come from. This is only speculation. I don't know this for certain. — J..., Apr 10 '15 at 11:15
As for the -1, it didn't come from me, but I would guess it is because the question is too broad. ie: "Explain how clock cycles are counted for ALL x86 instructions" -- this is a huge question. If you narrow the scope to simply "Why do PUSH and POP require a different number of cycles" this would be more appropriate and on-topic. — J..., Apr 10 '15 at 11:17
@J... It makes sense. Thank you sir! Do you happen to know why the `MOV` from register to memory takes 9 clock cycles, but `MOV` from memory to register only 8? And how it breaks down in term of clock cycles per sub-operation? — neat, Apr 10 '15 at 11:24
@NeatMonster I don't, but that's a great reference PDF you have there... I suspect the answers are in there somewhere! — J..., Apr 10 '15 at 11:27
@J... The `PUSH` and `POP` thingy was just an example I've stumbled upon. I haven't found any good information on how each instruction breaks down, but I guess that there exists a general way to calculate the clock cycles per instruction. For instance, I know that reading/writing a value into memory takes 4 clock cycles, plus 4 extra if it is an even-aligned word. — neat, Apr 10 '15 at 11:27
@J... I've read the manual linked in my OP. Besides the tables, it doesn't give much more information. I've also read [this one](http://www.ic.unicamp.br/~pannain/mc404/aulas/pdfs/Art%20Of%20Intel%20x86%20Assembly.pdf) but the explanations (page 107) do not match my observations nor the first one. — neat, Apr 10 '15 at 11:30
The book you want is Michael Abrash's "Zen of Assembly Language," which is long out of print but still available on the used market. As I recall, you can't easily break those timings into sub-operations. In addition, there are hidden costs such as instruction prefetch and DMA refresh (admittedly platform specific) that you have to take into account. The official instruction timings tell you what's happening on the CPU but they're "best case," assuming that the supporting hardware doesn't add anything. — Jim Mischel, Apr 10 '15 at 18:50
@JimMischel I've found an [online version](http://www.jagregory.com/abrash-zen-of-asm/) of the book. Great reading, thank you sir! — neat, Apr 11 '15 at 12:07
Related: https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has timing tables for 8086 / 8088 up through Pentium. For later CPUs, see https://agner.org/optimize/. (Overall speed is no longer just a sum of separate times for each instruction, thanks to pipelined and out-of-order exec.) — Peter Cordes, May 05 '21 at 12:01

score 4 · Accepted Answer · edited May 23 '17 at 12:25

How are cycles calculated and what does actually the clock do was a mystery to me as well, until I had the chance to work together with hardware guys and I could see what kind of models they work with. The answer lies in the hardware

CPU is parallel machine and although to programmers it's design is usually described in some simplifying terms explaining the pipeline or the microinstructions needed to implement it etc. CPU remains to be parallel machine.

For an instruction to complete, many tiny bit-size signals must flow through from one end to another. At some spots the processing units must wait till all the input bits arrive. This coordinated movement from one stage to another is driven by the clock-signal which is sent centrally to all the many parts. Each such move drummed by the clock-signal is called cycle.

So in order to know how many cycles are really needed to finish the work, you must take into account how are the wires connected and where the bits must flow through and where and how many are the required synchronization points.

enter image description here

I doubt if the Intel 8086 schematic is publicly available and even if it was then I doubt that it would be readable. But the only correct answer lies there. Everything else is just a simplification and to reproduce the exact hardware behavior in software, you would have to simulate/interpret the CPUs hardware

See also:

Electrical Engineering Stack Exchange: Map processor to circuit diagram
OpenCores.org: Processor - contains descriptions of various CPUs in various hardware description languages where you can see how exactly is the clock signal used

I thought, rightly or wrongly, that there were some repeating patterns in the way the instructions are executed (w/o taking the "fetching into the queue" part in consideration). But I guess it is far too complicated for what I'm trying to accomplish so I went ahead and implemented the clock cycles taken from the User Guide into [my emulator](https://github.com/NeatMonster/Intel8086/commit/68e2cb69bebb67e0fedf9028af6b834379220083) and it seems to work _okay_. Many thanks! — neat, Apr 10 '15 at 17:20

score 2 · Answer 2 · answered Apr 10 '15 at 11:26

2

The question is quite broad so I will only address the PUSH vs POP question here.

PUSH is basically a MOV from register to memory (plus register increment). POP is a MOV from memory to register (plus register decrement).

If you look at page 2-61 you will find :

MOV

register, memory 8+EA 1 2-4 MOV BP, STACK_TOP

memory, register 9+EA 1 2-4 MOV COUNT [DI], CX

For the POP operation, you already have the stack pointer in a register, so the effective address (EA) is zero. You can perform the MOV immediately and I can only assume that the special POP operation can decrement the stack pointer at the same time, somewhere in the later clock cycles of the read operation once the address is no longer needed.

For the PUSH operation you have an EA of 2 since the stack pointer must be incremented before obtaining the required address to perform the write. There can be no concurrency leveraged here so you have the 9 cycles for the MOV plus, seemingly, two for the effective address calculation (stack pointer increment).

answered Apr 10 '15 at 11:26

J...

30,968
6
66
143

speaking about `mov` only, reg to mem takes one cycle more than mem to reg (from the quote you gave) on 8086: 9+EA against 8+EA, but on 286 mem to reg takes two cycles more than reg to mem (5 against 3). I thought first that writing to memory is slower, than reading. but it doesn't make sense for 286... gone read datasheets... – Alexander Zhak Apr 10 '15 at 11:36
again for 386 writing to memory is two clocks faster, than reading from it – Alexander Zhak Apr 10 '15 at 11:40
@AlexanderZhak What could explain this difference between reading and writing on the 8086? Was I on the right track with my `XCHG`'s clock cycles breakdown? – neat Apr 10 '15 at 11:50
@AlexanderZhak Yes, well OP's question was about the 8086 so that's what I answered with. As for why, Intel has been improving things for years... suspect you'd need rather deep information about the CPU architecture to answer what, exactly, they did to change things. Newer CPUs have latency and throughput to consider - I think the 8086 did not do anything in the way of pipelining so the instruction times, I think, include the full sequence of fetch, decode, read/write, etc. – J... Apr 10 '15 at 11:51
1

@NeatMonster It sounds like you have a *lot* of questions. The appropriate way to use Stack Overflow is to think about what you are doing and try to break down your problem into individual questions. You should ask each one as a separate question here - one question, one answer. That's how it works. This isn't a help forum - continuing to ask more and more questions in comments is not how the site works. – J... Apr 10 '15 at 11:55
1

@J... I can't be more specific than "could you give me a general method to understand how clock cycles are counted." Is there another website that would be more susceptible to accept such questions? – neat Apr 10 '15 at 12:01
1

@NeatMonster I don't think that there is a general answer to that question - I think each instruction will have its own story to tell and each one would require microscopic analysis of the CPU architecture. Consider that the 800+ page manual for the CPU doesn't answer question adequately - this implies a very deep question that is not easy to answer in a short few paragraphs. A suitable next question could be "Why does 8086 MOV require 8 cycles for r,m but 9 for m,r?", for example. It is concise, answerable, and might get you closer to an answer to your bigger question. – J... Apr 10 '15 at 12:05

Counting an Intel 8086's clock cycles

2 Answers2