1

I have a task:

Given the array of 7 bytes. Representing it as an array of 8 7-bit words, count the number of words with an odd number of zeros in them.

Here's my code so far. I have a code sample with the check for the number of 1's oddity but can't understand how to check for an odd number of zeros in the word. How can I do it?

data segment
result dw ?
NB dw 04h, 07h, 14h, 23h, 04h, 38h, 3Fh, 2Ah
data ends

code segment
assume cs: code, ds: data
start: mov ax, data
mov ds, ax 
lea bx, NB
mov cx, 7

BEG:        
mov al, [bx]
test al, 0ffh
jp OK 
or add result, 1


loop BEG

quit: mov ax, 4C00h
int 21h
exit: ret
code ends
end start
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
awayneff
  • 11
  • 2
  • The question is nonsense -- an array of 7 bytes is not an array of 8 7-byte words, which would be 56 bytes. – Chris Dodd Mar 08 '22 at 07:27
  • Yes, I made a mistake, an array of 8 7-bit words – awayneff Mar 08 '22 at 07:28
  • https://www.felixcloutier.com/x86/popcnt Is this a valid approach? – xiver77 Mar 08 '22 at 09:53
  • 1
    @xiver77: I assume they want you to look at PF (the parity flag) after getting the 7 bits you want into a byte; it's set according to the low 8 bits of the result. Since the number of 1s and number of zeros are related, so is the even/oddness of their count. This code is actually already using PF to check for an even number of 1 bits. – Peter Cordes Mar 08 '22 at 10:06
  • @PeterCordes Oh, thanks for reminding me of the PF. – xiver77 Mar 08 '22 at 10:08

1 Answers1

2

Given the array of 7 bytes. Representing it as an array of 8 7-bit words, count the number of words with an odd number of zeros in them.

Your current solution is misinterpreting the task description!
For now, you have given the array 8, word-sized elements, whereas the task wants you to deal with 8 elements that are 7 bits wide and importantly those 8 elements have to be packed together in just 7 bytes of memory!
The array in effect is a bitstring that has 56 bits (8*7).

What does it mean to 'count the number of words with an odd number of zeros in them'?
word here is referring to a group of 7 bits, it's confusing I know.
The below snippet uses the bt (BitTest) instruction to test if a certain bit in a string of bits is ON (CF=1) or OFF (CF=0).

  xor bx, bx      ; Indexing the bitstring, BX=[0,55]
  mov [result], bl
OuterLoop:
  mov cl, 7
  xor ax, ax
InnerLoop:
  bt  [array], bx   ; Sets CF if bit is ON
  cmc               ; Makes CF=1 if bit is OFF
  adc al, 0         ; Tallies the OFF bits
  inc bx
  dec cl
  jnz InnerLoop
  and al, 1         ; Becomes 1 if tally is odd
  add [result], al
  cmp bx, 56        ; {7, 14, 21, 28, 35, 42, 49, 56}
  jb  OuterLoop

The above code can easily be written without using that cmc instruction (see Peter's comment).
Doing so will reveal a substantial speedgain. On my computer, executing the code a thousand times made the execution time drop from 597 µsec to 527 µsec (11% better).
This code will tally based on the inverse condition and subtract (the lowest bit of) the tally from the maximum achievable result.

  xor bx, bx      ; Indexing the bitstring, BX=[0,55]
  mov byte [result], 8
OuterLoop:
  mov cl, 7
  xor ax, ax
InnerLoop:
  bt  [array], bx   ; Sets CF if bit is ON
  adc al, 0         ; Tallies the ON bits
  inc bx
  dec cl
  jnz InnerLoop
  and al, 1         ; Becomes 1 if tally is odd
  sub [result], al
  cmp bx, 56        ; {7, 14, 21, 28, 35, 42, 49, 56}
  jb  OuterLoop

An alternative solution would extract 7 consecutive bits from the bitstring into a byte and (just like you did) inspect the parity flag PF. An odd number of zeroes in a group of 7 bits means that there will be a complementary even number of ones (PF=1). This solution is more than 4 times faster than the above snippets!

Next is an implementation of this:

result  db   0                                      ; (*)
array   db   12h, 34h, 56h, 78h, 9Ah, 0BCh, 0DEh
        db   0                                      ; (*)
...

        lea  bx, [array-1]       ; Start before the array
        mov  byte [bx], 0        ; Result=0
        mov  cl, 8               ; 8 elements
NextGroupOf7:
        mov  ax, [bx]
        shr  ax, cl
        and  al, 7Fh             ; Keep just 7 bits and define parity flag
        setp al                  ; If parity then AL=1 else AL=0
        add  [result], al
        inc  bx
        dec  cl
        jnz  NextGroupOf7

(*) The above code reads one byte before the array and reads one byte after the array. We have to make sure that these are addressable.

At the expense of increased codesize, we could split off the first and last iterations and not read outside of the array at all.


In case this should be 8086 code, you can not use the bt or setp instructions. Then next change to the 3rd snippet will work:

From:

    setp al                  ; If parity then AL=1 else AL=0
    add  [result], al
    inc  bx

to:

    jnp  .a
    inc  byte [result]
.a: inc  bx

ps. Don't use this jnp on anything better than 8086 as it will make the loop more than 10 times slower!

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
  • 1
    `adc al, ah` could be a more idiomatic `adc al, 0`, right? Doesn't even cost any extra code-size, unless your assembler avoids that special-case 2-byte encoding which is [slower on Sandybridge-family](https://stackoverflow.com/questions/51664369/which-intel-microarchitecture-introduced-the-adc-reg-0-single-uop-special-case). Or if you're optimizing, save the CMC by doing `sbb al, -1` (increase by 1, unless CF is set then don't.) But that's obviously harder to follow. Or just count ones and at the end do zeros = total - ones. – Peter Cordes Mar 10 '22 at 11:48
  • @PeterCordes I think the `adc al, ah` was a leftover from a draft where I used the `DL` register instead. I have removed the `cmc` instruction from the loop and got it to execute much faster. Thank you. – Sep Roland Mar 13 '22 at 14:43
  • If you care about performance, avoid `bt [mem], reg` and do the byte/word indexing manually. (Your `setp` version is that plus taking it a step further, avoiding looping over bits entirely.) The crazy-CISC bitstring semantics where `bt` can index outside the word addresses by the addressing mode are why it takes 10 uops on Skylake vs. 2 for `bt [mem], imm8`, or 1 for `bt reg,reg`. In your case you're doing contiguous ranges of bits in some words, so you *could* manually index once with `array + (bx >> 3)` and then loop over the range of bits you want. (Or `add reg,reg` to set CF, `adc r,0` – Peter Cordes Mar 13 '22 at 14:55
  • 1
    But I guess it's interesting to try to optimize other parts of the loop while still keeping the simplicity of `bt`'s bit-string indexing; any more serious optimization makes it silly to not just keep going and use PF. (on 8086, that can be done with `lahf` / `and ah, 1<<2` / `add result, ah`. Then right-shift the result once at the end, assuming you can do without 2 extra bits of range.) `result` could be a register to avoid store-forwarding latency inside the loop for every 7-bit chunk. – Peter Cordes Mar 13 '22 at 14:55
  • I'm curious how fast the other versions run on your machine. Also, what CPU did you test on? Zen2 and later have 0-latency store forwarding when the same addressing mode is used, which could make `add [result], al` not a latency bottleneck. (https://www.agner.org/forum/viewtopic.php?t=41) – Peter Cordes Mar 13 '22 at 15:06
  • On actual 8086, it's likely that `jnp` over an `inc` is better than `lahf`/`and`/`add` (plus shift outside the loop). That would be an 8086-compatible way that's ok on modern CPUs, not optimal for either. – Peter Cordes Mar 13 '22 at 15:07
  • If you have truly modern CPUs, `pdep` with a `0x7f7f7f...` mask should unpack 7-bit groups to separate bytes. Or AVX-512 VBMI `vpmultishiftqb` to do the same thing within 64-bit chunks of a vector. (Would need a `vpermb` get 7 source bytes into each separate qword to make room for the padding). AVX-512 BITALG (Ice Lake) `VPOPCNTB` to popcount each byte separately, mask that to 0/1, then do the same for the next 56 bytes of input and `vpaddb` in an inner loop. At the end, or before bytes would overflow, horizontal add of unsigned bytes using `vpsadbw` against zero and hsum qwords. – Peter Cordes Mar 13 '22 at 15:15
  • Or for a single vector, `vptestmb` instead of `vpandd` on the popcounts to get the low bits into a mask register, for `kmov rax, k0` / `popcnt rax,rax` to sum 64 parities. Total of 6 instructions, I think: `vpermb` -> `vpmultishiftqb` -> `vpopcntb` -> `vptestmb` -> `kmov` -> `popcnt`. Plus loading a shuffle mask and so on, and maybe setting a mask register if the input data length is odd, so you can't just `vmovq xmm0, [array]`. e.g. `vmovdqu8 zmm0{k1}{z}, [array]` can load any number of bytes from 0 to 64, depending on k1, zero-masking the bytes you don't load. (Can also zmask vpopcntb) – Peter Cordes Mar 13 '22 at 15:22
  • 1
    @PeterCordes (had to look this up) The CPU is "Intel Pentium dual-core processor T2080 / 1.73 GHz, 533 MHz FSB, 1MB level 2 cache". The execution times of all 4 snippets were: 597 µsec, 527 µsec, 136 µsec, and 1510 µsec. – Sep Roland Mar 13 '22 at 15:42
  • @PeterCordes For that `lahf` / `and` / `add` trick that you mention, shouldn't the result get corrected by right-shifting **twice**? You mentioned "Then right-shift the result once at the end" Oh wait, perhaps you meant "once (arrived) at the end"? – Sep Roland Mar 13 '22 at 15:49
  • Ok, wow, that's an old CPU. The generation before Core2, https://en.wikipedia.org/wiki/Yonah_(microprocessor), so it's 32-bit only, Pentium M / Core P6-family. Partial-register stalls are possible on that CPU, but I don't think your code has any. I think writing AX actually get renamed separately from the full EAX on CPUs that old, so there isn't even a false dependency on `mov ax, [bx]`, and `shr reg,cl` is single-uop. – Peter Cordes Mar 13 '22 at 15:50
  • Yes, I meant having one `shr` instruction execute at the end, with a count of 2. (Or two shr-by-1 on 8086.) I didn't notice the ambiguity when I was writing it :/ – Peter Cordes Mar 13 '22 at 15:51
  • `and al, 7Fh` can be `test al, 7Fh` to set PF according to those bits. You're about to overwrite AL anyway, and on later CPUs `setp al` has a dependency on the old value of AL. But it also has a dependency on FLAGS, so there's no latency or code-size saving, probably no benefit. Still, we don't need the AL result so it's arguably more idiomatic. (We can't avoid the shift, unfortunately; PF is only set according to the low 8 bits, so `test ax, si` with 7 bits somewhere in SI (`shl si,1`) wouldn't work.) – Peter Cordes Mar 13 '22 at 15:59