The pipeline stages aren't architectural (e.g. high performance PowerPC CPUs can have longer pipelines without changing anything software-visible). This wouldn't be part of single-stepping, unless you're using a software simulator to see instructions go through a simulated CPU.
If you're stopped at a breakpoint when you set another breakpoint or single-step, none of the instructions in the code being debugged will be in flight in the pipeline. The CPU will be asleep or running the debugger's code.
Also, PowerPC doesn't have coherent instruction caches, so self-modifying code that ran an stw
instruction which modified a few instructions after itself would not necessarily result in that new instruction being fetched, even if it was more than 5 instructions later. For that to happen reliably, you'd need an bunch of instructions like dcbf
and msync
to flush D-cache to a higher level, then icbi
to invalidate that line of I-cache, then msync
and isync
to make sure that happens before instruction-fetch. https://www.nxp.com/docs/en/application-note/AN3441.pdf for example documents what you should do in section 2.2 Instruction Cache Coherency, at least for that specific implementation of PowerPC. (I think Freescale’s PowerQUICC™ III is a PowerPC; this doc is what google found.)
Strangely GNU C __builtin___clear_cache(start, end)
doesn't do anything with GCC. https://godbolt.org/z/nbre7Y Possibly because the recommended procedure involves marking the page non-executable, which you can't do from user-space without a system call?
A debugger itself modifying memory of a process while it's not running has an easier time; OSes already have to make sure pages getting loaded into memory or modified with ptrace
are safe to execute code from. So it can leave most of the flushing to the OS.
When the kernel returns to user-space in the process being debugged, if the first or any later instruction is a debug-trap / software breakpoint instruction, it will trap.