Optimizing a bit-manipulating algorithm in GameBoy Z80

Question

This is not a homework problem, it's for a game I'm developing.

I have two 16-bit RGB colors, and would like to vary their six channels according to six other four-bit quantities. The algorithm is simple but tedious; I'm looking for a way to optimize it by doing more useful work at once.

The high-level overview:

hl points to the four color bytes. [hl] = %gggrrrrr, [hl+1] = %0bbbbbgg, [hl+2] = %GGGRRRRR, and [hl+3] = %0BBBBBGG. (That's two colors, rgb and RGB.)
bc points to the three delta bytes. [bc] = %hhhhaaaa, [bc+1] = %ddddssss, and [bc+2] = %ppppqqqq. (That's six delta values, h, a, d, s, p, and q.)
So there are six 5-bit color channel values, and six 4-bit delta values. I want to pair each color channel C with a delta value D, and vary C like so: C' = C + (D & %11) − ((D & %1100) >> 2), but keeping C within its 5-bit bounds [0, 31]. I don't actually care how they're paired: any convenient one-to-one pairing is fine. And if C + ((D & %1100) >> 2) − (D & %11) allows a more elegant algorithm somehow, I'd be okay with that.

If I isolate a color channel C in register d and a delta value D in register e, then this routine will do the variation for that pair:

VaryColorChannelByDV:
; d = color, e = DV
; a <- d + (e & %11) - (e >> 2), clamped to [0, 31]
    ld a, e
    and %11   ; a <- (e & %11)
    add d   ; a <- d + (e & %11)
    srl e
    srl e   ; e <- e >> 2
    sub e   ; a <- d + (e & %11) - (e >> 2)
    jr c, .zero   ; a < 0, clamp to 0
    cp 32
    ret c   ; 0 <= a < 32
    ld a, 31   ; a >= 32, clamp to 31
    ret
.zero
    xor a
    ret

So far I have a generic routine that applies any DV to any color channel; then three routines that isolate the red, green, or blue channels and apply a given DV to them; and finally a main routine that picks out the six DVs and calls the appropriate channel-modifying routine with them. This is "good enough", but I'm certain there's room for improvement. Execution speed doesn't seem to be a problem, but I'd like to reduce the code size (and of course removing redundant instructions will also improve speed a bit). Are there any asm bit-manipulation tricks that would help?

Here's the full code:

GetColorChannelVariedByDV:
; d = color, e = DV
; a <- d + (e & %11) - (e & %1100 >> 2), clamped to [0, 31]
    ld a, e
    and %11
    add d
    srl e
    srl e
    sub e
    jr c, .zero
    cp 32
    ret c
    ld a, 31
    ret
.zero
    xor a
    ret

VaryRedByDV:
;;; e = DV
;;; [hl+0] = gggr:rrrr
;;; [hl+1] = 0bbb:bbgg
; store red in d
    ld a, [hl]
    and %00011111
    ld d, a
; vary d according to e
    call GetColorChannelVariedByDV
; store a back in red
    ld d, a
    ld a, [hl]
    and %11100000
    or d
    ld [hl], a
    ret

VaryGreenByDV:
;;; e = DV
;;; [hl+0] = gggr:rrrr
;;; [hl+1] = 0bbb:bbgg
; store green in d
    ld a, [hli]
    and %11100000
    srl a
    swap a
    ld d, a ; d = 00000ggg
    ld a, [hld]
    and %00000011
    swap a
    srl a
    or d
    ld d, a
; vary d according to e
    call GetColorChannelVariedByDV
; store a back in green
    sla a
    swap a
    ld d, a
    and %11100000
    ld e, a
    ld a, d
    and %00000011
    ld d, a
    ld a, [hl]
    and %00011111
    or e
    ld [hli], a
    ld a, [hl]
    and %11111100
    or d
    ld [hld], a
    ret

VaryBlueByDV:
;;; e = DV
;;; [hl+0] = gggr:rrrr
;;; [hl+1] = 0bbb:bbgg
; store blue in d
    inc hl
    ld a, [hl]
    and %01111100
    srl a
    srl a
    ld d, a
; vary d according to e
    call GetColorChannelVariedByDV
; store a back in blue
    ld d, a
    sla d
    sla d
    ld a, [hl]
    and %10000011
    or d
    ld [hl], a
    dec hl
    ret

VaryColorsByDVs::
; hl = colors
; [hl+0] = gggr:rrrr
; [hl+1] = 0bbb:bbgg
; [hl+2] = GGGR:RRRR
; [hl+3] = 0BBB:BBGG
; bc = DVs
; [bc+0] = hhhh:aaaa
; [bc+1] = dddd:ssss
; [bc+2] = pppp:qqqq

;;; LiteRed ~ hDV, aka, rrrrr ~ hhhh
; store hDV in e
    ld a, [bc]
    swap a
    and %1111
    ld e, a
; vary LiteRed by e
    call VaryRedByDV

;;; LiteGrn ~ aDV, aka, ggggg ~ aaaa
; store aDV in e
    ld a, [bc]
    and %1111
    ld e, a
; vary LiteGrn by e
    call VaryGreenByDV

;;; move from h/a DV to d/s DV
    inc bc

;;; LiteBlu ~ dDV, aka, bbbbb ~ dddd
; store dDV in e
    ld a, [bc]
    swap a
    and %1111
    ld e, a
; vary LiteBlu by e
    call VaryBlueByDV

;;; Move from Lite color to Dark color
    inc hl
    inc hl

;;; DarkRed ~ sDV, aka, RRRRR ~ ssss
; store sDV in e
    ld a, [bc]
    and %1111
    ld e, a
; vary DarkRed by e
    call VaryRedByDV

;;; move from d/s DV to p/q DV
    inc bc

;;; DarkGrn ~ pDV, aka, GGGGG ~ pppp
; store pDV in e
    ld a, [bc]
    swap a
    and %1111
    ld e, a
; vary DarkGrn by e
    call VaryGreenByDV

;;; DarkBlu ~ qDV, aka, BBBBB ~ qqqq
; store qDV in e
    ld a, [bc]
    and %1111
    ld e, a
; vary DarkBlu by e
    call VaryBlueByDV

    ret

Is there a reason that the delta values (two-bits) are not added together before the operation? What I mean - when D is for example %1001, You add 1 and substract 2, which results in substracting 1. Can You represent the delta as a single 3-bit signed value instead? — Roman Hocke, May 15 '17 at 14:31
Or You can create 16-bytes long lookup table for every 4-bit delta combination, so You do not need to add and substract the two-bit pieces of the delta every time. — Roman Hocke, May 15 '17 at 14:34
Looks like an interesting, well-constructed question, so have an upvote! Unfortunately, I'm not sure how many GameBoy assembly-language experts there are here on Stack Overflow, so if you don't get the answer you're looking for after a day or two, you might consider translating the important bits of your code into a pseudocode notation so that all of us can play along, even if we don't know your native dialect. It definitely looks like there are some redundant operations in there, but I'm not comfortable enough with the syntax to be able to confidently tell. — Cody Gray - on strike, May 15 '17 at 15:23
Did you mean to saturate the 5-bit (0-31) input color components to 4 bits (0-15)? If so, are the input colors also in that range? — jacobly, May 16 '17 at 04:11
I did not; I mixed up the 5-bit colors and 4-bit deltas. I've updated the post. — Remy, May 16 '17 at 04:13

jacobly · Accepted Answer · 2017-05-16T13:03:10.677

The smallest I can come up with right now is 57 bytes:

VaryColorsByDVs::
; hl = colors
; [hl+0] = gggr:rrrr
; [hl+1] = 0bbb:bbgg
; [hl+2] = GGGR:RRRR
; [hl+3] = 0BBB:BBGG
; bc = DVs
; [bc+0] = hhhh:aaaa
; [bc+1] = dddd:ssss
; [bc+2] = pppp:qqqq
    ld a, 2 ; -floor($100/3)*6 mod $100
.next:
    sla [hl]
    inc hl
    rl [hl]
.loop:
    push af
    rrca
    ld a, [bc]
    jr nc, .skip
    swap a
    inc bc
.skip:
    rlca
    ld d, a
    and %00011000
    ld e, a
    ld a, d
    rlca
    rlca
    and %00011000
    add a, [hl]
    jr nc, .noOverflow
    or %11111000
.noOverflow:
    sub e
    jr nc, .noUnderflow
    and %00000111
.noUnderflow:
    dec hl
    ld de, 5
.rotate:
    add a, a
    rl [hl]
    adc a, d
    dec e
    jr nz, .rotate
    inc hl
    ld [hl], a
    pop af
    add a, 85 ; floor($100/3)
    jr nc, .loop
    ret z
    inc hl
    jr .next

Fixing Ped7g's comment only costs 4 bytes for a total of 61 bytes:

VaryColorsByDVs::
; hl = colors
; [hl+0] = gggr:rrrr
; [hl+1] = 0bbb:bbgg
; [hl+2] = GGGR:RRRR
; [hl+3] = 0BBB:BBGG
; bc = DVs
; [bc+0] = hhhh:aaaa
; [bc+1] = dddd:ssss
; [bc+2] = pppp:qqqq
    ld a, 2 ; -floor($100/3)*6 mod $100
.next:
    sla [hl]
    inc hl
    rl [hl]
.loop:
    push af
    rrca
    ld a, [bc]
    jr nc, .skip
    swap a
    inc bc
.skip:
    ld d, a
    and %00001100
    ld e, a
    ld a, d
    rlca
    rlca
    and %00001100
    sub e
    add a, a
    jr nc, .positive
.negative:
    add a, [hl]
    jr c, .continue
    and %00000111
    db $38 ; jr c,
.positive:
    add a, [hl]
    jr nc, .continue
    or %11111000
.continue:
    dec hl
    ld de, 5
.rotate:
    add a, a
    rl [hl]
    adc a, d
    dec e
    jr nz, .rotate
    inc hl
    ld [hl], a
    pop af
    add a, 85 ; floor($100/3)
    jr nc, .loop
    ret z
    inc hl
    jr .next

This is clamping max colour value before "-(DV>>2)", so colour 31 vs DV 0x1111 will end as 28. (just saying for OP, so he can decide whether this is ok or not). — Ped7g, May 16 '17 at 12:00
That is definitely optimized for size! Thank you! I'd be impressed if you could get it smaller, but that would be code golf, not really necessary. Are you pairing the channels and DVs the same way? It's fine if not. — Remy, May 16 '17 at 17:52
Also, is the "db $38" using the value of the subsequent instruction ($86, right?) as a relative jump amount? Very tricky. ;) Although that seems like too large a jump distance, so I think I'm misunderstanding it. — Remy, May 16 '17 at 17:58
The pairing is bgrBGR <-> ahsdqp. The "db $38" trick looks like a jr c,garbage to the cpu, but the previous instruction actually clears the carry flag, therefore it always skips the following instruction but only costs 1 byte. The same thing could also be accomplished with "db $16 ; ld d," but I prefer conditional jr opcodes because they don't destroy any registers. — jacobly, May 16 '17 at 18:35
@Remy what do you mean by "**would be** code golf", this is already golfed almost to death. — Ped7g, May 17 '17 at 02:06

score 3 · Answer 2 · answered May 15 '17 at 19:50

Hmm... you should give us more information about where those data are coming from, if you can preprocess them further, because that +(d&3)-(d>>2) looks unfortunate and I would try to avoid that, if possible. Actually the whole 5:5:5 RGB stuff is probably a bit over the head of Z80, but if you know it will work for you, go ahead (I'm talking from my ZX Spectrum experience, where 3.5MHz was hardly enough to manipulate 1 bit B&W pixels).

But for the moment, what you already got, can be a bit simplified immediately by removing two ld instructions:

VaryColorChannelByDV:
    ...
    add d
;    ld d, a   ; d <- d + (e & %11)
    srl e
    srl e
;    ld a, d   ;### A didn't change, still contains C + DV&3
    sub e   ; a <- d + (e & %11) - (e & %1100 >> 2)
    ...

And if you are not short on memory, you can create 256B look-up-table to clamp values, so for example you would keep in h or b the high address byte of the table, and the result in a would be then loaded into l or c and clamped by ld a,(hl/bc). Which is 4+7 t instead of those jr/cp/ret/.... You would actually need only some values out of those 256, from -3 to 34 (0..34 and 253..255) if I didn't miscalculate it (0 + 0 - 3 is minimum, and 31 + 3 - 0 is maximum result). So you can still use bytes at addresses "inside the page" 35..252 for other data or code.

I will try to take a look on it as a whole later, to avoid some of the per-component generic stuff if possible, but I'm afraid a better input data format would probably give you bigger boost, or to know your overall goal and all the constraints (like if the top bit in RGB is always 0 and must be 0, or can be random as result, and is 0 as input, etc... every detail can often lead to another removed instruction, which is often 4-11 t worth on Z80).

The 5:5:5 RGB format is already being used by the game I'm editing. I don't think any preprocessing is possible: the color values are taken from a table (and unvaried colors are needed sometimes too), and the delta values are meaningful data outside this routine. All I have to do for them is acquire pointers to them in hl and bc, and then call the routine. Speed is less of an issue than space; eliminating instructions is great, but I think a 256-byte LUT wastes more in space than it saves in time. — Remy, May 15 '17 at 22:56

Optimizing a bit-manipulating algorithm in GameBoy Z80

2 Answers2