RISC pipeline effect on setting breakpoints?

Question

In RISC Pipelining instructions are insist of 5 steps.
I have a question about whether pipelining could affect setting breakpoints.

Example:
Assume that below binary is running and $pc is at line 1

line1: lwz r11 8(r31)     <= PC @ here
line2: lwz r0, 0(r31)
line3: cmpwi cr7, r10, 0
line4: lwz r9, 4(r31)
line5: stw r11, 0xA0(r1)

Pipeline State(My guess):
As far as I know, PPC instructions has 5 state: fetch, decode, execute, memory access, write back
In this moment, I guess the pipelining would be like below. Is it correct?

line1: lwz r11 8(r31)     <= execution (because PC is at here)
line2: lwz r0, 0(r31)     <= decode
line3: cmpwi cr7, r10, 0  <= fetch
line4: lwz r9, 4(r31)
line5: stw r11, 0xA0(r1)

Question

Is the state that I wrote correct?
In this moment, is it not allowed to change instruction in line 3 at runtime by debugger?
(such as seting breakpoint at line 3?)

this is all very much implementation defined, for starters what the real hardwares pipe looks like that is a textbook representation not necessarily reality. Then how the debugger works if there is one is implementation defined there is no generic answer that covers all implementations. — old_timer, Mar 03 '21 at 14:58

score 2 · Answer 1 · answered Mar 03 '21 at 07:24

2

On a real microprocessor it depends if you ...

... use a JTAG debugger
(this means some external hardware is stopping the CPU when the breakpoint is reached)

In this case it depends on the CPU what happens if you change lines "2" or "3" when the CPU is stopped in a breakpoint. There may be CPUs that read the memory again after continuing from a breakpoint and other CPUs that don't do this.

I don't know how PowerPC microcontrollers (like MPC57xx) behave.

However, I would guess that in most microcontrollers the hardware is (intentionally) designed in a way that the pipeline does not work "normally" in the case of a breakpoint: After reaching the breakpoint, "lines 1" to "3" are re-read from memory.
... or if you are performing on-chip debugging
(this means that the debugging software is running on the same CPU as the software being debugged)

In this case some exception is entered when the breakpoint is entered.

In the case of a PowerPC, the exception returns using an rfdi or (in the case of older controllers) rfci instruction.

This means that the debugger uses the rfdi instruction to continue the program being debugged.

rfdi, rfi, rfci are jump/branch instructions. After a jump/branch the CPU has to re-read the pipeline anyway.

This means that the CPU will definitely even read "line 1" from memory again, so you can even modify "line 1" in the breakpoint.

answered Mar 03 '21 at 07:24

Martin Rosenau

17,897
3
19
38

*After a jump/branch the CPU has to re-read the pipeline anyway.* - Is that architecturally guaranteed for PowerPC like it unofficially is for x86? (Modern x86 is even stronger, [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/q/17395557) is AFAIK impossible, but there were CPUs where a jmp of some sort flushed instruction prefetch.) However, for PowerPC without I-cache even being coherent, why would the CPU have to re-fetch (or maintain the illusion of having done so) after a correctly predicted branch? – Peter Cordes Mar 03 '21 at 07:32
Or is that effect truly observable in a PowerPC without caches? – Peter Cordes Mar 03 '21 at 07:32
@PeterCordes During the exception handler, the instructions of the exception handler itself are in the pipeline; not the instructions of the code where the exception happened. This is true for any CPU. The question of the OP boils down to the question if the exception handler can modify the code where the exception happened. – Martin Rosenau Mar 03 '21 at 09:49
Right, that's what they're asking, but you appear to have made a much broader statement about all jumps / branches, which appears to include normal user-space jumps. Did you actually only mean privilege-transfer branches like returning from an exception? (And separately from the pipeline itself, does returning from an exception sync D-cache to I-cache if you haven't done that manually?) – Peter Cordes Mar 03 '21 at 09:54
@PeterCordes I didn't go too deep into detail in my answer. But what I wanted to say should be true for any kind of jumps - even for architectures with "delay slots" and similar: Modifying the code at the "jump destination" before performing a jump should always be possible because the pipeline does not conain instructions at the jump destination, but instructions at the jump source. (Maybe you have to wait some instructions between writing the new instruction and performing the jump.) Cache is a different problem; but this has nothing to do with the pipeline. – Martin Rosenau Mar 03 '21 at 13:15
Sounds like you're focusing on a classic 5-stage RISC without branch prediction, not real-world PowerPC or its on-paper ISA. In a classic RISC yes, fetch will restart after a branch, although branch latency is low enough that a store right before the branch won't hit MEM until later, I think. So yeah, like you said, a bit of delay is needed even with no store buffer. – Peter Cordes Mar 03 '21 at 13:19
But more importantly, **a classic 5-stage RISC has split I and D caches**. Unless they're coherent somehow, even a store 10 instructions before a branch to the stored word doesn't give any guarantee of executing the newly stored data. PowerPC doesn't guarantee that, and neither does MIPS. – Peter Cordes Mar 03 '21 at 13:19
1

@PeterCordes This is not what I wanted to say. What I wanted to say is: When you perform a "return" instruction (such as `ret` on x86, `jr ra` on MIPS or `blr` on PowerPC), the pipeline is not "restored" to the state at the "call" instruction; instead, the pipeline content must be re-read from memory at **some** point in time before or after the "return" instruction. The OP's concern would however only be justified if the "return" instruction "restored" the pipeline somehow. – Martin Rosenau Mar 03 '21 at 20:21
Ah, I see. Yes, that last comment is a good breakdown of what the problem seems to be in the OP's mental model, and explains it better (or more succinctly) than either of our answers do >. – Peter Cordes Mar 04 '21 at 02:34

score 2 · Answer 2 · answered Mar 04 '21 at 00:11

If you want to see the many states of the instructions as they flow through a POWER processor pipeline, you can take an instruction trace, use the Power Performance Simulator to simulate the instructions flowing through the various pipes (there are often many), and use one of the viewers associated with the Power Performance Simulator to look at state changes cycle-by-cycle. I wrote an article (https://developer.ibm.com/technologies/linux/tutorials/l-porting-tuning-tools/) that has a quick summary of how to do so.

All that being said, user-level debuggers like gdb see the instructions as roughly atomic. The state before an instruction has executed is as if all of the prior instructions have completed and the current instruction has not changed any state.

score 1 · Answer 3 · answered Mar 03 '21 at 05:57

The pipeline stages aren't architectural (e.g. high performance PowerPC CPUs can have longer pipelines without changing anything software-visible). This wouldn't be part of single-stepping, unless you're using a software simulator to see instructions go through a simulated CPU.

If you're stopped at a breakpoint when you set another breakpoint or single-step, none of the instructions in the code being debugged will be in flight in the pipeline. The CPU will be asleep or running the debugger's code.

Also, PowerPC doesn't have coherent instruction caches, so self-modifying code that ran an stw instruction which modified a few instructions after itself would not necessarily result in that new instruction being fetched, even if it was more than 5 instructions later. For that to happen reliably, you'd need an bunch of instructions like dcbf and msync to flush D-cache to a higher level, then icbi to invalidate that line of I-cache, then msync and isync to make sure that happens before instruction-fetch. https://www.nxp.com/docs/en/application-note/AN3441.pdf for example documents what you should do in section 2.2 Instruction Cache Coherency, at least for that specific implementation of PowerPC. (I think Freescale’s PowerQUICC™ III is a PowerPC; this doc is what google found.)

Strangely GNU C __builtin___clear_cache(start, end) doesn't do anything with GCC. https://godbolt.org/z/nbre7Y Possibly because the recommended procedure involves marking the page non-executable, which you can't do from user-space without a system call?

A debugger itself modifying memory of a process while it's not running has an easier time; OSes already have to make sure pages getting loaded into memory or modified with ptrace are safe to execute code from. So it can leave most of the flushing to the OS.

When the kernel returns to user-space in the process being debugged, if the first or any later instruction is a debug-trap / software breakpoint instruction, it will trap.

RISC pipeline effect on setting breakpoints?

3 Answers3