What algorithms and/or patterns to use for an In-Circuit Processor Emulator (Z80)

Question

For my electronics hobby where I make a Z80 computer system, I am building a Z80 in-circuit emulator. The idea is that the physical Z80 chip is removed from the circuit and the emulator is inserted in its socket and emulates the Z80 precisely. Additionally the emulator would implement debug and diagnostic support - but that is not what the question is about. The idea now is that this in-circuit emulator will run inside a PSoC5 module and talk to the PC over USB.

I have currently setup a ginormous state machine in code (C) that is advanced every clock edge change (pos/neg-edge) - twice per clock cycle. I have called this clock-ticks.

The problem is that this state-machine code is becoming unwieldy large and complex.

I have generated structs for every Z80 instruction that contains details on what functions to call for each processing cycle. (a Z80 instruction may take as much as 6 processing (Machine) cycles, which each take at least 3 (usually 4 or more) clock cycles.

Here is an example of a more elaborate instruction that take 4 machine cycles to complete. The strange names are used to encode the attributes of each instruction and generate unique names. During each machine cycle, the appropriate OnClock_Xxxx function is called, multiple times - for each clock tick within that machine cycle.

// ADD IY, SP   -  ADDIY_SP_FD2  -  FD, 39
const InstructionInfo instructionInfoADDIY_SP_FD2 =
{
    4,
    0,
    {
        { 4, OnClock_OF },
        { 4, OnClock_ADDIY_o_FD2_OF },
        { 4, OnClock_ADDIY_o_FD2_OP },
        { 3, OnClock_ADDIY_o_FD2_OP },
        { 0, nullptr },
        { 0, nullptr },
    },
    {
        { Type_RegistersSP16, {3} },
        { Type_None, {0} },
    }
};

References to these instruction information structs are stored in tables for quick lookup during decoding.

I have a global structure that contains the state of the Z80, like clock cycle counting, registers and state used during instruction processing - like operands etc. All code operates on this global state.

To interact with the host (either a unit test or the PSoC5 micro-controller) I have setup a simple interface that controls the pins of the Z80, either requesting input (read the data bus) or output (activate MEMREQ).

To implement the state-machine in code I have used a dirty C-trick that involves jumping in and out of a switch statement, tucked away behind macros. This makes the code readable as normal (but async) code.

Here's an example of what this async state-machine code would look like for the logic to fetch and decode an opcode:

Async_Function(FetchDecode)
{
    AssertClock(M1, T1, Level_PosEdge, 1);
    setRefresh(Inactive);
    setAddressPC();
    setM1(Active);
    Async_Yield();

    _state.Clock.TL++;

    AssertClock(M1, T1, Level_NegEdge, 2);
    setMemReq(Active);
    setRd(Active);
    Async_Yield();

    NextTCycle();

    AssertClock(M1, T2, Level_PosEdge, 3);
    // time for some book keeping
    if (_state.Instruction.InstructionAddress == 0)
        _state.Instruction.InstructionAddress = _state.Registers.PC - 1;
    Async_Yield();

    _state.Clock.TL++;

    AssertClock(M1, T2, Level_NegEdge, 4);
    Async_Yield();

    NextTCycle();

    AssertClock(M1, T3, Level_PosEdge, 5);
    _state.Instruction.Data = getDataBus();
    setRd(Inactive);
    setMemReq(Inactive);
    setM1(Inactive);
    setAddressIR();
    setRefresh(Active);
    Async_Yield();

    _state.Clock.TL++;

    AssertClock(M1, T3, Level_NegEdge, 6);
    setMemReq(Active);
    Decode();
    Async_Yield();
}
Async_End

Async_Yield() exits the function and a next call to the function will resume execution there.

Ok, now for the question: I have trouble getting the state machine to behave just right, which make me question my line of reasoning about the problem. Because processing the more complex instructions involves a lot more states in the state machine, I find it hard to reason about the code - which is a sign/smell.

Are there any obvious algorithms and/or patterns that one uses for writing this type of clock-cycle accurate emulator?

maybe best suited for http://codereview.stackexchange.com or software engineering. — Jean-François Fabre, Aug 25 '19 at 07:23
I didn't know about codereview, but I checked software engineering - that is meant for more ALM and deployment types subjects. https://meta.stackoverflow.com/questions/254570/choosing-between-stack-overflow-and-software-engineering — obiwanjacobi, Aug 25 '19 at 07:27
maybe retrocomputing then. As much as I like retrocomputing & old CPUs, I'm not sure you'll get an answer on SO. For instance _"Are there any obvious algorithms and/or patterns that one uses for writing this type of clock-cycle accurate emulator?"_ look very much like a recommendation. You may want to rephrase this, as well researched your question looks. — Jean-François Fabre, Aug 25 '19 at 09:53
see [What's the proper implementation for hardware emulation?](https://stackoverflow.com/a/18911590/2521214) it holds a link to mine Z80 iset with per MC timing with types of MC which is more or less what you do now hardcoding. Its a formatted TXT file that can be loaded into your emulator directly ... simplifying the CPU core a lot ... It contains all opcodes up to 4Byte (included) and passing ZEXALL 100%. So in your case I would just encode the timing of each MC type (its just few of them) ... t move from MC timing into individual clock cycles — Spektre, Feb 13 '20 at 16:19

Tommy · Answer 1 · 2019-08-25T18:29:44.467

I've written similar code twice, supposing that implies that I know anything, and have separately implemented similar simulations of the 6502 and 68000.

I think the main tip is: there are only a very small number of potential machine cycles, and they present the same bus activity (data lines aside) regardless of the instruction involved. Which implies that you can avoid lengthy, hard to maintain code either with with an extra level of indirection at runtime, or through automated code construction — I tend just to rely on the preprocessor but others have written code that constructs code.

So e.g. instead of writing out, say, PUSH at length, you can describe it compactly as:

the standard fetch, decode, execute;
decrement the stack pointer;
perform a standard 3-cycle write machine cycle to the stack pointer with the high part of whatever you're writing;
decrement the stack pointer;
perform a standard 3-cycle write machine cycle to the stack pointer with the low part of whatever you're writing.

There's an underlying fiction here: you're supposing you can just decrement the stack pointer in zero units of time between standard machine cycles. But the benefit of adopting that fiction is being able to use standard machine cycles in between.

If you follow this line of implementation you'll likely end up at a loop more like:

MicroOp *next_op = start of reset program;
while(true) {
    MicroOp *op = next_op;
    next_op = op + 1;

    switch(op->action) {
        case Increment16: ++op->u16; continue;
        case Decrement16: 
            ... etc, etc, etc, all uncounted operations ending in continue ...

        case BeginNextInstruction:
            next_op = fetch-decode-execute operations;
        continue;

        case PerformMachineCycle: break;
    }

    /* Begin machine cycle execution. */

    switch(op->machine_cycle) {
        case Read3:
            ... stuff of a standard 3-cycle read, from op->address to op->u8 ...
        break;
        case Write3:
            ... etc, etc ...
    }
}

It sounds like you actually want your loop to be interruptible, in which case the only place you can possibly need to return from is within the machine-cycle execution part at the bottom, since that's the only part that actually costs time. You can just keep an independent counter like 'number of half-cycles into this machine cycle' and do an appropriate switch jump within the outer op->machine_cycle switch.

That's not exactly how my main loop is formed, but it's close enough; I've 546 lines total to set up the micro-op program for each instruction. I do that programmatically at construction time. For the Z80 it's largely macro-based table formulation, though on the 68000 I ended up with what amount to a disassembler so definitely go that way if you want — actually pulling out the individual fields and processing them is a great safeguard against an obscure table typo.

The code that executes whatever I've stored as my micro-ops is 1062 lines.

Mine's actually set up to talk at the metacycle level, so it'll directly broadcast "I now performed a 3-cycle read" rather than spelling out the 6 half-cycle states in between, but it is announcing at half-cycle precision and providing exactly the quantity of detail that would allow broadcast at half-cycle fidelity. I've just omitted an extra level of bus interfacing for computational simplicity as mine isn't talking to original hardware unlike yours. But there's no semantic loss of detail.

In an earlier implementation I avoided that simplification: everything was announced as the complete bus state — as a primitive 64-bit int containing those of the original 40 pins that carry signals rather than power or ground. That was great, but computationally prohibitive just because of the sheer number of function calls once you had a few components listening to the bus, and the effect that leaping all over the place like that has on a processor cache.

I need to trigger each clock-tick of the hardware clock signal provided by the circuit. So metacycles is not going to cut it. This code runs instead of the real chip. — obiwanjacobi, Aug 25 '19 at 19:11
@obiwanjacobi right, that's what the "as mine isn't talking to original hardware unlike yours" proviso covers. I mean, you could broadcast metacycles to somebody that then serialised them for the bus and then requested the next metacycle on demand, in which case you'd keep the `while(true)` but as soon as you get down to the `switch(op->machine_cycle)`, just return the machine cycle. The caller then has the "if this is a refresh cycle, sequence out these four bus states, one on each of the next clock edges" logic in it. — Tommy, Aug 25 '19 at 19:19

What algorithms and/or patterns to use for an In-Circuit Processor Emulator (Z80)

1 Answers1