How to write two bytes to a chunk of RAM repeatedly in Z80 asm

Question

I'm trying to write two bytes (color values) to the VRAM of my TI-84 Plus CE-T calculator, which uses the Zilog eZ80 CPU. The VRAM starts at 0xD40000 and is 0x25800 bytes long. The calculator has a built in syscall called MemSet, which fills a chunk of memory with one byte, but I want it to alternate between two different values and store these in memory. I tried using the following code:

#include "includes\ti84pce.inc"

    .assume ADL=1
    .org userMem-2
    .db tExtTok,tAsm84CeCmp

    call  _homeup
    call  _ClrScrnFull
    ld    hl,13893632     ; = D40000, vram start
    ld    bc,153600       ; = 025800, count/vram length
j1:
    ld    (hl),31         ; set first byte
    inc   hl
    dec   bc
    jr    z,j2            ; jump to end if count==0
    ld    (hl),0          ; set second byte
    inc   hl
    dec   bc
    jr    z,j2            ; jump to end if count==0
    jp    j1              ; loop
j2:
    call  _GetKey
    call  _ClrScrnFull
    ret

I want it to output 31 00 31 00 31 00... into memory starting at 0xD40000, but instead it seems to change only the first byte and jump to the end after doing so. Any ideas on how to fix this?

The equivalent C function for patterns wider than 1 byte is "wmemset" that takes a `wchar` (wide char) argument. I wouldn't be surprised if your calculator is lacking that, though. Other platforms have it, though; e.g. glibc has an asm optimized wmemset for various targets. — Peter Cordes, Aug 14 '19 at 12:32
You should really consider using `fasmg` for your assembly programming. It is far superior and actually supported. Here's more information: https://github.com/CE-Programming/documentation — MateoConLechuga, Aug 20 '19 at 03:44

DrDnar · Answer 1 · 2019-08-21T07:43:00.013

7

First of all, if you're going to move SP, you need to save and restore it. Second, you need to disable interrupts or else you'll have a race condition bug: if an interrupt triggers near the end of the copy, the stack will grow down into whatever is below it, which happens to be the VAT.

; Index registers are actually fast on the eZ80
    ld   ix, 0
    add  ix, sp
    di
; Do some hack using SP here
    ld   sp, ix
    ei

@Ped7g The eZ80 will cache any -IR/-DR suffix instruction; unlike the Z80, it doesn't reread the opcode from memory on each iteration. Consequently, instructions like LDIR can execute each iteration in just 2 bus cycles, one read and one write. ~~The SP hack is therefore not only needlessly complicated, but actually slower.~~ The SP hack still best left to more experienced programmers.

The eZ80 is very well pipelined and its performance is limited by its lack of any cache and 1-byte-wide bus. The only instruction that runs slower than the bus is MLT, a 2-bus-cycle instruction that needs 5 clock cycles. For every other instruction, just count the number of bytes in the opcode, and the number of read and write cycles, and you've got its execution time. It's a huge pity that in the TI-84+CE series, TI decided to pair the fast eZ80 with an SRAM that somehow needs four clock cycles for each read and write (at 48 MHz)! Yes, TI, a world leader in semiconductor design, managed to design a slow SRAM. Getting on-die SRAM to perform poorly is an engineering feat.

@harold has the right answer, though I prefer optimizing for size instead of speed outside of inner loops.

#include "includes\ti84pce.inc"

    .assume ADL=1
    .org userMem-2
    .db tExtTok,tAsm84CeCmp

    call  _homeup
    call  _ClrScrnFull
; Initialize registers
    ld    hl, vRam
    ld    bc, lcdWidth * lcdHeight * 2 - 2
    push  hl
    pop   de
; Write initial 2-byte value
    ld    (hl), 31
    inc   hl
    ld    (hl), 0
    inc   hl
    ex    de, hl
; Copy everything all at once.  Interrupts may trigger while this instruction is processing.
    ldir
    call  _GetKey
    call  _ClrScrnFull
    ret

On EFnet, #ez80-dev is a good place to ask questions. cemetech.net is also a good place.

edited Aug 21 '19 at 07:43

answered Aug 20 '19 at 02:33

DrDnar

71
3

thank you for the details on eZ80, it's interesting read for me (I was doing Z80 on ZX Spectrum, never getting close to eZ80 later). – Ped7g Aug 20 '19 at 04:40
Cheers, a very good answer indeed. The info on slow SRAM on that particular HW is new and surely requires the whole approach to be reconsidered. As a side note: [Retrocomputing](https://retrocomputing.stackexchange.com/) is also a good resource for questions like this. – tum_ Aug 20 '19 at 06:29
@DrDnar: I have added an 'Update 3' to my "answer". Would you care to comment, please? – tum_ Aug 20 '19 at 10:07
@DrDnar: "Second, you need to disable interrupts or else you'll get a bit of graphical garbage because you're copying to higher addresses but the stack pushes to lower addresses." - hmm? Not 'copying' but writing the register content into memory ('filling') and by using `PUSH` this is also done towards the lower addresses. – tum_ Aug 20 '19 at 14:49
1

You're right, for some strange reason I was thinking about LDIR there. But interrupts still need to be disabled because an interrupt near the end of the copy will corrupt the VAT. – DrDnar Aug 20 '19 at 15:25
Yes, I was just not sure what is default state on TI, and whether even disabling interrupts is feasible... but as long as it is similar to ZX in this regard (i.e. by default after ROM init the interrupts works, and can be disabled without any major consequence to user code), that push filler should be enclosed in `DI`/`EI` (before `sp` is modified, and after it is restored). – Ped7g Aug 20 '19 at 15:38
1

@Ped7g and DrDnar: if it's a very big fill, might be worth it to use `push` for most of it (with interrupts enabled), but stop 128 or 256 bytes short (or whatever the max interrupt stack consumption is) and fill the rest with a separate safe loop. Costs more code size and some time outside the main loop so only worth it if the speedup vs. 2nd-best fill loop pays for the extra overhead. Of course this might not be OK for video RAM that can be read asynchronously and display the "garbage", or other async DMA cases. (IDK anything about TI calculators or much about Z80, but I grok interrupts.) – Peter Cordes Aug 21 '19 at 07:53

score 6 · Answer 2 · answered Aug 13 '19 at 21:04

This does not work:

dec   bc
jr    z,j2

Only 8 bit dec and inc modify the flags. It could be fixed by properly detecting whether bc is zero.

Here is a different technique without manual looping:

ld    hl,$D40000
ld    (hl),31
inc   hl
ld    (hl),0
dec   hl
ld    de,$D40002
ld    bc,$25800 - 2
ldir

Ped7g · Accepted Answer · 2019-08-14T14:20:12.100

5

The variation of tum_'s answer with faster-than-regular-dec bc zero test mechanism for looping.

    LD   SP,$D65800    ; <end of VRAM>: 0xD40000+0x25800
    LD   BC,$004B      ; 0x4B many times (in C) the 256x inner loop (B=0)
        ; that results into 0x4B00 repeats of loop, which when 8 bytes per loop
        ; are set makes the total 0x25800 bytes (VRAM size)
        ; (if you would unroll it for more than 8 bytes, it will be a bit more
        ; tricky to calculate the initial BC to get correct amount of looping)
        ; (not that much tricky, just a tiny bit)
    LD   HL,31         ; H <- 0, L <- 31
.L1
    PUSH HL            ; (SP – 2) <- L, (SP – 1) <- H, SP <- SP - 2
    PUSH HL            ; set 8 bytes in each iteration
    PUSH HL
    PUSH HL
    DJNZ .L1           ; loop by B value (in this example it starts as 0 => 256x loop)
    DEC  C             ; loop by C ("outer" counter)
    JR   NZ,.L1        ; btw JP is faster than JR on original Z80, but not on eZ80
.END

(BTW I never did eZ80 programming, and I didn't verify this in debugger, so this is kinda full of assumptions... actually thinking about it, isn't push on eZ80 32 bit? The the init of hl should be ld hl,$001F001F to set four bytes with single push, and the inner body of loop should have only two push hl)

(but I did ton of Z80 programming, so that's why I even bother with comment on this topic, even if I haven't seen eZ80 code ever before)

Edit: turns out the eZ80 push is 24 bit, i.e. the code above will produce incorrect result. It can be of course easily fixed (as the issue is implementation detail, not principal), like:

    LD   SP,$D65800    ; <end of VRAM>: 0xD40000+0x25800
    LD   BC,$0014      ; 0x14 many times (in C) the 256x inner loop (B=0)
        ; that results into 0x1400 repeats of loop, which with 30 bytes per
        ; loop set makes the total 0x25800 bytes (VRAM size)
    LD   HL,$1F001F    ; will set bytes 31,  0, 31
    LD   DE,$001F00    ; will set bytes  0, 31,  0
.L1
    PUSH DE
    PUSH HL
        ; here SP = SP-6, and 6 bytes 31, 0, 31, 0, 31, 0 were set
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL
    PUSH DE
    PUSH HL            ; unrolled 5 times to set 30 bytes in total
    DJNZ .L1           ; loop by B value (in this example it starts as 0 => 256x loop)
    DEC  C             ; loop by C ("outer" counter)
    JR   NZ,.L1

edited Aug 14 '19 at 14:20

answered Aug 14 '19 at 12:21

Ped7g

16,236
3
26
63

yeah, the **e**Z80 mentioned in the Q. vanished from my brain as soon as I started typing... Added an **Update** in the answer. – tum_ Aug 14 '19 at 13:39
1

It appears that HL,DE,BC,SP (and hence `push`) on eZ80 are 24 bit (in ADL mode, see OP's `.assume ADL=1`). Yes - 3 bytes. Most unusual )) – tum_ Aug 14 '19 at 14:05
@tum_ ok, so then the code in this answer requires two registers to be used to push 6B at least... and unrolling then ... LOL .. if you would unroll it 100 times, it would set 600 bytes, and that's precisely the 1/256 of the VRAM, i.e. you need 256x loop then. :D – Ped7g Aug 14 '19 at 14:13
1

+1 for the updated answer. It also appears that LDIR on eZ80 is 2 cycles per byte, while PUSH (in ADL mode) is 4 cycles which gives 4/3 cycles per byte. A smaller gain compared to z80 where LDIR is 21T vs PUSH's 5.5T (per byte). – tum_ Aug 14 '19 at 14:56
so after changing the `.L1` to `L1:`, the code works, but after executing the code and pressing a button on the calculator (`call _GetKey` in my original code, I added it to the end of this code as well), the calculator crashes. Any idea why it does this? – melbok Aug 14 '19 at 19:43
2

@melbok did you add `sp` preservation? Like reserve 24bit of memory space somewhere in data section (not sure what your assembler syntax is, maybe `OldSp: ds 3` ?), and in the code section ahead of `ld sp,...` do first `ld (OldSp),sp` and after the fill `ld sp,(OldSp)` to restore it back. (like do you understand what is `sp` and how "abused" it's "usual" purpose is in this memory filler?) (probably, not, as you did add `call _GetKey` straight after it :) ... before `call` you need already `sp` restored, otherwise you are overwriting memory under VRAM, if there is any writeable at all) – Ped7g Aug 14 '19 at 20:46
1

@melbok yes, *call* and, even more importantly, *ret* instructions implicitly use SP. *ret* fetches the return address from the stack, so unless you restore the correct value of stack pointer you'll eventually be in trouble. I did put a disclaimer at the end of my answer. – tum_ Aug 14 '19 at 22:03
2

@tum_ if he's just starting with assembly, no amount of disclaimers will help, there are so many new things and details, I can imagine it being quite overwhelming... also newcomers to assembly, if they did some programming in higher level language before, often don't realize the level of precision required in assembly, and simply skip some word or two, or even whole disclaimer, because "can't be that important, right?"... :D ... So it was kinda nasty from both of us to not provide full working code including the `sp` preservation... then again the OP didn't even specify assembler (syntax) :) – Ped7g Aug 14 '19 at 22:53

tum_ · Answer 4 · 2019-08-20T14:58:47.913

See Update 3 at the bottom.

In addition to @harold's answer: if there's a need for a faster alternative a well-known trick with PUSH can be used.

I'm not familiar with TI-84, the stack trick might be unacceptable on some systems or require interrupts to be disabled. And of course you are supposed to store/restore the SP before/after the above code.

Update 3: Removed my code snippet as it was incorrect for eZ80 anyway. However, thanks to the links provided by @DrDnar here is someone's -not mine! :)- attempt to push the performance to the limit (yes, I'm aware that filling with $55 is not the same as alternating between $31 and $00):

Code:

FastClr:
        ld      de,$555555      ; will write byte 85 (= blue color)
        or      a
        sbc     hl,hl
        ld      b,217
        di
        add     hl,sp           ; saves SP in HL
        ld      sp,vram+76818   ; for best optimisation , we'll write 18 extra bytes
ClrLp:  .fill 118,$d5           ;       = 118 * "PUSH DE"
        djnz    ClrLp           ; during 217 times
        ld      sp,hl           ; restore SP
        ei

16+4+8+8+4+4+16+217*(118*10+13)-5+4+4=258944 States !!! ;D (the classic LDIR takes about 537600 states)

Cemetech source There are more (allegedly faster) examples there.

This at least raises certain doubts regarding the claim that LDIR is the fastest option, so i would be interested in @DrDnar's comments.

Note: I'm not saying the claim is wrong as I'm not in the position to test any of this and see for myself. I've noticed that the author of the above code, although they mention "TI83PCE/TI84+CE" in the original post, perform the actual measurements on a TI83PCE only - and this might be important.

Also, the addresses and the size used in the code are not the same as in the OP's code, and the "8bpp mode" is mentioned, which again tells me very little but the OP does not mention any particular mode.

Update 4: The link provided by @iPhoenix contains loads of info on the TI-84+CE, including the LCD Controller details. This page explains, among other things, why '8bpp mode' has been specifically mentioned by the author of that code above:

When the LCD is in 8bpp mode, data written to VRAM will act as an 8-bit index to the LCD's 256x16-bit Color Palette. Note that the colour palette must be initialized prior to setting this mode or you will receive unexpected results. (See LCDPalette register - 0x200 for information on the Color Palette). This will effectively halve the amount of VRAM required to store a full resolution 320x240 image (76800 bytes vs 153600 bytes). The extra 76800 bytes of VRAM could be used to double buffer or for temporary data storage. Note that the TIOS will not be usable in this mode, it expects 16bpp 5:6:5 mode at all times.

In other words - rather than filling the 153600 bytes of VRAM with 0x31,0x00 (16bpp colour, presumably), the OP could fill half of VRAM with a single byte value XY and configure (prior to the actual filling) the Color Palette so that the XY value maps to the desired 16bpp colour and thus achieve the same result.

With this approach any "inconveniences" of alternating between 31 and 00 just go away naturally.

that's kinda funny, that you go for performance by replacing regular write with `push`, and then you do that `dec bc` thing... If you are truly going for performance, you should unroll the `push hl` at least few times, for example 4x, that's 8 bytes, so then you need 0x4B00 repetitions, and those can be achieved like init part: `ld bc,$004B` and loop body: `j1: 4x push hl` `djnz j1` `dec c` `jr nz,j1` — Ped7g, Aug 14 '19 at 08:52
@Ped7g Sure, the goal was to show an example. Speed vs size, loop unrolling techniques are pretty obvious. Feel free to post your answer, let's see how many T-states per byte you achieve :) (I'm too lazy to decipher the unformatted code in the comment). — tum_, Aug 14 '19 at 12:01
ok, I posted it as full answer, although it's just variation of yours.. also I did notice you use JP vs JR optimization, but that is Z80 thing, not eZ80. — Ped7g, Aug 14 '19 at 12:22
The 83PCE runs the same processor as the 84+CE, and the hardware is mostly the same except for a little testing led (and some extra minor things). See http://wikiti.brandonw.net/index.php?title=Category:83PCE:OS_Information — iPhoenix, Aug 20 '19 at 11:23
[Wiki article](https://en.m.wikipedia.org/wiki/TI-84_Plus_series) says: "In 2016, the TI-84 Plus CE-T was released for the European educational market. The only significant difference from the CE model is the addition of an LED that blinks while the calculator is in Press-to-Test mode." — tum_, Aug 20 '19 at 13:47
Ah, you're right, the stack hack is still faster than LDIR, at under 1.4 cycles per byte. As Ped7g pointed out, you can alternate pushing two 24-bit registers to fill a 16-bit value. I still don't recommend it unless you're an expert in assembly optimization. Check out [this code in our C SDK](https://github.com/CE-Programming/toolchain/blob/master/src/graphx/graphx.asm#L480) which copies some temporary code to a location in a memory-mapped peripheral that the CPU can access at just 2 cycles-per-byte instead of the normal 4 cycles-per-byte. (Note the use of 8-bit color mode.) — DrDnar, Aug 20 '19 at 15:01
@DrDnar [edited] "I still don't recommend it unless you're an expert in assembly optimization." - you'd never become one if you'd followed recommendations like this :) The code under the link is pretty impressive. Some bits look quite unusual for a non-eZ80 person but generally the tricks are pretty much the same as they were back in the '80s in Spectrum times. — tum_, Aug 21 '19 at 08:22

How to write two bytes to a chunk of RAM repeatedly in Z80 asm

4 Answers4

Linked

Related