assembly 68k - clear starting from address efficently

Question

This is my snippet to clear data from SCREEN address to 600 bytes.

    lea SCREEN,a3
    move.w  #(600/4)-1,d0   ; bytes / 4 bytes (long)
clear_data:
   clr.l    (a3)+
   dbra d0,clear_data

This works, however I wonder how to achieve the same result without cycling by 600/4 times. Basically I guess to point directly to SCREEN and doing something like

; point PC to SCREEN ?
dcb.b 600,0

Is is possible ?

EDIT POST ANSWER

Still using software code, this cycle is about 2 times faster (Stolen from RamJam course):

    lea SCREEN,a3
    move.w  #(600/32)-1,d0   ; bytes / 32 bytes (long*8)
clear_data:
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    clr.l    (a3)+
    dbra d0,clear_data

However, as Peter mentioend on the answer, using a blitter (if provided by the hardware) can drastically optimize the performance.

I'm not very familiar with M68K, but I think you want `dbnz` for your loop (`dbra` would be an infinite loop), and that this is more efficient than you might think unless you're using the original 68000 chip (later models added a special case for `dbXX` with a single instruction inside the loop). — jasonharper, Jun 28 '18 at 18:42
DBRA is *not* an infinite loop; it terminates when the register concerned reaches -1. DBxx terminates when the loop counter reaches -1 OR the condition is true — vogomatix, Sep 06 '18 at 18:48

vogomatix · Answer 1 · 2018-09-06T09:01:06.353

Use MOVEM to really burn through setting memory!

I recommend you don't use CLR.L; if you look at the clock timings you'll find it seems to be quite inefficient. It is better to load a register with the value you want to set memory to and then MOVE.L ,(A0)+

However, for extreme rapidity, use MOVEM.L to set/clear large areas of memory. It is 2 to 3 times faster than using CLR or standard MOVE.L

Here's a subroutine example that sets 64 byte blocks, then sets any remaining long words, but can be customisable.

         ORG     $2000
         MOVE.L  #MEMSTART,A0        ; memory to clear
         MOVE.L  #ZEROS,A1           ; value to set memory to e.g. 0
         MOVE.L  #600,D7             ; number of bytes
         BSR     SETBLOCK
         STOP    #2700

SETBLOCK
         ; MOVEM doesn't support destination = (Ax)+, 
         ; does support destination = -(Ax)
         ADD.L   D7,A0               ; so start at end

         LSR.L   #2,D7               ; divide by 4 for Long words.
         MOVE.L  D7,D6
         LSR.L   #4,D6               ; # of 16 longword blocks 
         BEQ.S   NOBLOCK             ; branch if no none
         SUBQ.L  #1,D6               ; one less so DBRA works
         MOVEM.L (A1),D0-D4/A2-A4    ; 8 registers = 32 bytes 

ZAPBLOCK MOVEM.L D0-D4/A2-A4,-(A0)   ; 8 x 4 = 32 bytes
         MOVEM.L D0-D4/A2-A4,-(A0)   ; 8 x 4 again for 64 bytes
         DBRA    D6,ZAPBLOCK         ; loop ends when D7=-1
NOBLOCK  AND.W   #$0F,D7             ; how many long words left
         BEQ.S   NONE
         ; do any remainder
         SUBQ.W  #1,D7               ; 1 less so DBRA works
         MOVE.L  (A1),D0             ; pattern in D0 if not there b4
ZAP      MOVE.L  D0,-(A0)            ; set memory long word at a time
         DBRA    D7,ZAP
NONE
         RTS

ZEROS    DC.L    0,0,0,0,0,0,0,0      ; 8x4 = 32
         ORG     $2500
MEMSTART DS.B    600

This example uses D0-D4 and A2-A4 to get 8 registers to set 32 bytes at a time, repeated twice for 64 bytes. There's no reason why you can't add more MOVEM instructions to the ZAPBLOCK loop to write to 128, 256 or more bytes for each loop iteration, changing the LSR/ AND instructions accordingly.

Note that DBRA only operates on words, so this will only set 65k x the block size. This can be fixed, for example by using SUBQ and BGT, instead of DBRA.

For some reason I recall that the CLR instruction did a read as well as a write on some 68k's

Timing

Comparing 3 alternatives, assuming a standard 68000 with a 16 bit data bus...

Using CLR

LOOP:  
       CLR (A0)+      12+8
        DBRA D7,LOOP   10/14

30 cycles for every long word, 20 per long word with multiple clears.

Using MOVE.L

    MOVEQ #0,D0        ; 4
LOOP:
    MOVE.L D0,(A0)+    ; 12
    DBRA   D7,LOOP     ; 10/14

22 cycles per long word, 12 per long word with multiple MOVE.L operations.

Using MOVEM.L

LOOP:
    MOVEM.L  D0-D4/A2-A4,-(A0)    ;  8+8*8 = 72
    MOVEM.L  D0-D4/A2-A4,-(A0)    ;  8+8*8 = 72
    DBRA     D6,LOOP              ;  10/14

154 cycles/iteration but only around 9.5 cycles per long word. This is probably competitive with the performance of a hardware blitter.

https://hackaday.io/project/6150-beckman-du600-reverse-engineering/log/62868-fast-68000-block-memory-move-routine-comparison has couple real-world performance numbers for using move-multiple (`movem.l`). — Peter Cordes, Aug 29 '18 at 16:50
A long time ago in a galaxy far far away I rewrote the Atari ST OS to use tricks like this :) — vogomatix, Aug 29 '18 at 17:57

score 2 · Accepted Answer · answered Jun 28 '18 at 18:40

No, storing 4 bytes at a time in a loop is probably about as good as you can get. Maybe unrolling a bit to reduce loop overhead, if that tight loop doesn't max out memory bandwidth on whatever m68k hardware you care about. Or maybe not: @jasonharper comments that later m68k chips have special support for 2-instruction loops.

dcb.b 600,0 is an assemble-time thing which assembles bytes into your output file.

You can't "run" it at runtime. Remember that asm source doesn't run directly; it's a way to create binary files containing m68k machine code and/or data.

You can use data directives mixed with instructions to "manually" encode instructions by specifying the machine-code bytes you want, but 600 bytes of zeros will just decode as some m68k instructions. (I didn't check how 00 00 decodes on m68k.)

Some computers based on m68k had hardware chips for doing stuff to big blocks of memory. This was typically called a blitter chip (Wikipedia). e.g. some Atari m68k desktops, like the Mega STe, had a BLiTTER chip.

You could run a few instructions on the CPU to program the blitter to clear or copy a big block of memory while the CPU went on to run other instructions. This is basically a DMA copy engine.

Uh since I'm on Amiga the blotter looks good! But I still have to get on it — Fabrizio Stellato, Jun 28 '18 at 19:05

score 1 · Answer 3 · answered Jun 21 '20 at 20:14

Vogomatix method 3 is actually quick, but not by as far as claimed. For some reason he has omitted the initial register load setup time in this example which is quite significant.

You must add 'Moveq #0,d0-4 (16 cycles) then the 4 address registers: 12 cycles each reg x 4 is 48, so 64 cycles for setup.

At 218 cyc/iteration thats ~13.6 cycles per longword; not quite as good as solution 2.

The fastest I have found is to max out however many registers you can (typically use 13) and then:

movem.l (An),a0-4/d0-7; 13 Longs    12+8n    = 116 cycles
movem.l a0-4/d0-7,(An);              8+8n   = 112 cycles

The above to mem move (112 cycles) line then can be repeated for however many writes are required. The cyc/long speed doesnt catch up with the previous examples speeds until at least 48 or more are written in one iteration loop, by which time we are just under 12 cyc/long. So for e.g. 128 longs, we get 1300 cycles or around 10.2 cyc/long. The more writes, the lower the cycle setup+execution rolling average gets reduced until it slowly gets nearer to the theoretical 8+8n cycl per long limit.

You can also maximise iteration efficiency further by using all the registers (16), but then you have the complication of harder data flow control by having 15 regs and tricky workarounds for register A7 which is used for the SP if using all 16.

218 cyc/iteration isn't a useful way to state the overhead; that only applies if you redo the setup inside the loop every iteration. Or if you fix the iteration count at 1, in which case just say that instead of calling it "per iteration". As you discuss later, the useful thing is amortized overhead and the break-even point vs. a simpler loop using fewer registers. — Peter Cordes, Jun 21 '20 at 20:39

assembly 68k - clear starting from address efficently

3 Answers3