How to wraparound in the 64KB code segment in a program that stitches itself to its own tail, ad infinitum?

Question

If the sequential execution of instructions passes offset 65535, then the 8086 will fetch the next instruction byte from offset 0 in the same code segment.

Next .COM program uses this fact and continually stitches its entire code (32 bytes in total) to its own tail, wrapping around in the 64KB code segment. You could call this a binary quine.

    ORG 256            ; .COM programs start with CS=DS=ES=SS

Begin:
    mov  ax, cs        ; 2 Providing an exterior stack
    add  ax, 4096      ; 3
    mov  ss, ax        ; 2
    mov  sp, 256       ; 3
    cli                ; 1
    call Next          ; 3 This gets encoded with a relative offset
Next:
    pop  bx            ; 1  -> BX is current address of Next
    sub  bx, 14        ; 3  -> BX is current address of Begin
More:
    mov  al, [bx]      ; 2
    mov  [bx+32], al   ; 3
    inc  bx
    test bx, 31        ; 4
    jnz  More          ; 2
    nop                ; 1
    nop                ; 1
    nop                ; 1

For the benefit of the call and pop instructions, will the program set up a small stack exterior to the code segment. I don't think the cli is really necessary because we do have a stack.
Once we have calculated the address of the current start of our 32-byte program, we copy it 32 bytes higher in memory. All the BX pointer arithmetic will wraparound.
We then fall through in the newly written code.

If the sequential execution of instructions passes offset 65535, then the 80386 will trigger exception 13.

Assuming that I include the necessary setup for an exception handler, would it be enough to just execute a far jump to the start of this code segment (where the newly written code sits waiting)? And would such a solution remain valid on post 80386 CPU's?

Related: Is it possible to make an assembly program that writes itself forever?

This could work only if an instruction boundary lands on the segment boundary. Is the size of your program a power of two, so that it divides evenly into the segment? — prl, Aug 08 '21 at 21:40
@prl The code has 32 bytes and begins at offset 256. Both are suitable powers of two. If I were to extend this it would become 64 bytes. — Sep Roland, Aug 08 '21 at 21:42
"And would such a solution remain valid on post 80386 CPU's?" I do believe so. — ecm, Aug 08 '21 at 22:05
Could you end with a `jmp $+2` instead of a true `nop` to explicitly truncate IP on 386, if that's necessary? I guess you could look at that as actually jumping instead of letting the wrap-around happen naturally, though. (It would also avoid stale instruction prefetch on all real CPUs, even if they have larger prefetch buffers than the 6 bytes on 8086 which is small enough to not be a problem.) — Peter Cordes, Aug 08 '21 at 22:56
@PeterCordes That's a nice way to avoid sequential execution wraparound, but even with `jmp $+2` will the CPU have to do a 64KB wrap on the `IP` register. So maybe that's the same thing... Also, the question that inspired me, kind of insisted on falling through in the new code. A *jumping solution* that can not/should not fail would be to replace the `nop`s by a `jmp bx` instruction. The wrap-around on the `BX` register is always safe. — Sep Roland, Aug 08 '21 at 23:18
@SepRoland: 16-bit operand-size jumps *do* truncate IP on real hardware, without faulting. This is why o16 `jmp rel16` isn't usable in normal 32-bit code ([GAS assembler not using 2-byte relative JMP displacement encoding (only 1-byte or 4-byte)](https://stackoverflow.com/a/50341926)). That doesn't fault, even if letting execution go on its own would on 386 like your quote indicates. (i.e. if program-counter increments implicitly use EIP, so they'd violate the segment limit if carrying non-zero bits into the top half of EIP) — Peter Cordes, Aug 08 '21 at 23:23
Ok, Margaret's answer confirms we don't need explicit truncation of IP; it will wrap on its own without a jmp even on 386, as long as you don't have a dangling multi-byte instruction. BTW, [supercat suggested](https://stackoverflow.com/questions/68696831/is-it-possible-to-make-an-assembly-program-that-writes-itself-forever#comment121405878_68697102) incrementing another pointer in lockstep with IP instead of copying it every time, e.g. `ADD SP,SI / PUSH AX / PUSH BX / PUSH CX / NOP` to store instruction bytes. That's a multiple of 6, but could be adapted to 8 bytes. — Peter Cordes, Aug 09 '21 at 09:19

score 3 · Accepted Answer · answered Aug 09 '21 at 08:59

In 16-bit mode (real or protect), the IP register will wrap around 64KiB without any fault, granted that no instruction crosses the 64KiB boundary (e.g. a two bytes instruction placed at 0xffff).

A crossing instruction will fault on an 80386+, not sure what will happen on previous models (read the next byte in the linear address space? read the next byte from 0?).

Note that this works because the segment limit is the same as the IP register "limit".
In 16-bit protected mode you can set a segment limit less than 64KiB, in that case, the execution will fault when reaching the end.
In short (and figuratively), the CPU makes sure all the bytes it needs are within the segment limit and then will increment the program counter without overflow detection.

So your program should work.

It's probably a bit of a stretch to call it a quine because it's reading its own machine code and that's cheating (just like reading the source code file is for high-level languages).

I haven't tested it, but a very minimal example of a program "kind of replicating" itself could be:

 ;Setup (assuming ES=CS)
 mov al, 0abh       ;This encodes stosb
 mov di, _next      ;Where to start writing the instruction stream

 stosb              ;Let's roll

_next:

This is also not a quine because only the stosb is replicated.
Making a quine is hard, the stores must be instructions whose encoding is less than the size of the data stored or we will always have more bytes to write than those written.

A quine is allowed to include data as part of itself. So one could e.g. have 32 bytes of code followed by a one's-complemented copy thereof, and have that code read the one's-complement copy, duplicate it ahead 32 bytes, and then bit-invert it prior to execution reaching it. — supercat, Aug 09 '21 at 19:11
By the Pentium I this didn't work. It faulted with IP=10000h. I tried it on a physical Pentium I. Anybody got an 8086 or 80286 to try it on? I suspect it won't work on a 386. — Joshua, Jan 11 '22 at 21:43
`A crossing instruction will fault on an 80386+`. That is true. But a crossing instruction will fault on 80286+. On an 8086(80186)/8088(80188) the crossing instruction will wrap since the BIU is used during prefetch (a word is fetched on the 86 processors and a byte on the 88 processors). Since the BIU is used to read data from memory (needed by an instruction) and for filling the instruction prefetch queue the same wrapping applies to instructions and data. Simply put a crossing instruction on processors earlier than 80286 wrap back to 0. — Michael Petch, Apr 13 '23 at 20:20

How to wraparound in the 64KB code segment in a program that stitches itself to its own tail, ad infinitum?

1 Answers1

Linked