2

I am learning the running process of the program on gem5. And read some books. But I am still confused about the parts in the program execution. Is my understanding below correct?

  1. First, the computer instruction is placed in the ICache, and the computer CPU controls the fetching of instructions, and puts the fetched instructions into the instruction queue. Then take the instruction from the instruction queue and decode the instruction into micro-operations. These micro operations will be sent to the reorder-buffer. If the micro operation in the reorder buffer is a load/store, it will be sent to the load/store queue. If it is an operation such as addition and subtraction, it will be sent directly to the execution unit. In this process, each execution unit has a reserved station for register rename. When the micro operation is completed, it will return to the reorder buffer. When the micro operation reaches the head of the reorder buffer, it can be written back to the Cache or memory outside the CPU.
  2. The load queue will fetch data from the cache. The cache is generally a virtual address index and a physical address tag. The load operation will fetch from the Cache in parallel and perform virtual address translation. If the Cache misses, the load operation will be sent to MSHR. MSHR sends the fetched data to the memory. If the data is in the memory, the data is fetched into the Cache first, and then the fetched data is returned to the MSHR, and then returned to the load queue. If the data is not in the memory, the operating system will issue a page fault, and then the data will be fetched from the hard disk to the memory, fetched from the Cache, and then returned to the MSHR, and then returned to the load queue.

questions:

  1. Does micro-operation refer to operations such as mov and add when decoded?
  2. Does the instruction issue refer to micro-operations sent to the reorder buffer or sent from the reorder buffer to the execution unit? After the instruction is decoded, is it sent directly to the reorder buffer?
  3. Do dispatch and issue refer to the same process?
  4. I have also seen some queues such as floating-point number queues or other queues. Does this refer to the place where instructions are temporarily stored due to insufficient execution units when instructions are sent from the reorder buffer to the execution unit? Do these queues refer to the same thing as the reservation station in the Tomasulo algorithm?

enter image description here

Gerrie
  • 736
  • 3
  • 18
  • I assume you're talking about x86, given the mention of uops (micro-ops). Go read Agner Fog's microarch PDF, especially the section on Intel Core 2 or Nehalem sounds most like what you describe (no uop cache, but with a decoded-instruction queue to buffer between decode and rename/allocate into the back-end.) Optimizing for those real CPUs involves caring about front-end decode a lot, but you can mostly skip that section if the details don't match the front-end decode mechanism you're tuning for. Also other links in https://stackoverflow.com/tags/x86/info, esp. David Kanter's writeups. – Peter Cordes Oct 17 '20 at 14:18
  • Terminology: In Intel terminology, "issue" = moving a uop from the front-end into the out-of-order back-end, into the ROB and RS (Reservation Station, aka scheduler, which you're calling a queue. So maybe K8 / K10 would be a better model if you have separate FP and integer schedulers where uops wait for their inputs to be ready, and for a free execution port; Intel uses a unified scheduler (until Skylake?).) "Dispatch" = sending a uop from the scheduler to an execution unit. Non-x86 computer-architecture textbooks often use the opposite terminology. They're never synonyms. – Peter Cordes Oct 17 '20 at 14:22
  • 1
    See also (re: real x86 CPUs): [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331), http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/, and maybe [Are load ops deallocated from the RS when they dispatch, complete or some other time?](https://stackoverflow.com/q/59905395) – Peter Cordes Oct 17 '20 at 14:26
  • Is the instruction sent from the reorder buffer to the Reservation Station? Can it be sent directly to the reservation station from the front end? Is there a scheduler after decoding and after reorder buffer? The scheduler after reorder buffer refers to Reservation Station (that is, floating-point number queue, etc.?). I added a picture from the top meeting paper after the question. But the figure shows that the queue is allocated after decoding, why not allocate the queue when the reorder buffer distributes instructions? – Gerrie Oct 17 '20 at 14:39
  • What I understand is that the reorder buffer sends instructions to the reservation station (that is, the queues of various execution units), and then these queues dispatch the instructions to the execution units. But this seems to be inconsistent with that image – Gerrie Oct 17 '20 at 14:39
  • *Is the instruction sent from the reorder buffer to the Reservation Station?* - No, it's sent to both by the issue stage, except for instructions like `nop` or `xor eax,eax` that don't need any back-end uops. In that case, the ROB entry can be marked as "already executed, ready to retire" initially, without having to wait for a completion signal for the corresponding RS entry. I've always found diagrams that show ROB -> RS -> execution units misleading because I don't think that's how the hardware really works, and uops have to stay in the ROB from issue until retirement. – Peter Cordes Oct 17 '20 at 14:46
  • Do you mean that similar to nop or xor eax, eax commands are sent directly from the front end to the reservation station? It is marked as already execute in the reorder buffer. Those micro-operations similar to add mul are first sent to the reorder buffer, and then sent from the reorder buffer to the reservation station. Then the reservation station is dispatched to the execution unit? – Gerrie Oct 17 '20 at 14:51
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223203/discussion-between-c-yj-and-peter-cordes). – Gerrie Oct 17 '20 at 14:55
  • 1
    No, `nop` and "eliminated" uops (no execution unit) go into the ROB, with no RS entries because nothing actually needs to execute. *Everything* goes in the ROB, that makes it possible to recover from mis-speculations (like branches or exceptions). See [Can x86's MOV really be "free"? Why can't I reproduce this at all?](//stackoverflow.com/q/44169342) and [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](//stackoverflow.com/q/33666617). uops that do need an execution unit issue into the ROB *and* RS. The front-end can't issue them if either is full – Peter Cordes Oct 17 '20 at 14:55
  • 1
    But seriously, if you haven't read Agner Fog's microarch guide explanation of PPro and Nehalem, go do that now. For comparison, https://www.realworldtech.com/barcelona/5/ describes the separate schedulers for each port that AMD K10 uses. It describes things in terms of instructions issuing from the ROB to schedulers, which again may not be accurate, or not a useful mental model. – Peter Cordes Oct 17 '20 at 15:02
  • BTW, this diagram (including a uop cache, an extra store-AGU on its own port, and 4 ALU ports) looks very much like Haswell / Skylake. https://www.realworldtech.com/sandy-bridge/. I assume this GEM CPU is intentionally designed to model Haswell / Skylake. – Peter Cordes Oct 18 '20 at 23:55

0 Answers0