I'm trying to remove all of the '0'
s from the beginning of a string in NASM.
mov ecx, "0001" ; ecx now holds 0x31303030
-> code that clears the first 3 zeros so that ecx will hoad 0x31 only
How can I remove them?
I'm trying to remove all of the '0'
s from the beginning of a string in NASM.
mov ecx, "0001" ; ecx now holds 0x31303030
-> code that clears the first 3 zeros so that ecx will hoad 0x31 only
How can I remove them?
Look at cl
- if it doesn't equal '0'
you're done, otherwise shift ecx
one byte to the right and repeat.
remove_zeroes:
cmp cl,'0'
jne done
shr ecx,8
jmp remove_zeroes
done:
If you can spare another register, this could be improved a bit:
remove_zeroes:
mov eax,ecx
shr ecx,8
cmp al,'0'
je remove_zeroes
mov ecx,eax
Assuming you want the high bytes to be 0
(not '0'
) when you're done, @Michael's loop will work even the 2nd one isn't quite as efficient as it could be. See Why are loops always compiled into "do...while" style (tail jump)? for standard efficient loop structure, i.e. peeling the compare/branch part of the first iteration so we can rotate the loop.
cmp cl, '0'
jne .done ; low byte already not ASCII '0'
.remove_zeros:
shr ecx,8
cmp cl,'0'
je .remove_zeros
.done:
There are other interesting things you could do: branchless non-looping
For example using ctz(ecx ^ '0000')
to get the position of the first bit-difference (counting from the bottom of the register), then rounding it down to a multiple of 8 bits as a shift count. Handling the case where all 4 bytes are '0'
requires some care because it requires a shift-count of 32, but BSF can't produce that, and 32-bit shifts mask the count so it would wrap.
However, implementing ctz
as TZCNT will produce 32 for an all-zero input, and a 64-bit operand-size for the shift will let it work, making with &63
instead of &31
. (shr
docs)
mov eax, ecx
xor ecx, '0000' ; or '000' to always keep the highest byte
; jz special_case_all_zero ; optional
tzcnt ecx, ecx ; decodes as BSF on older CPUs, which is ok for non-zero ECX
and ecx, -8 ; clear low 3 bits = round down to multiple of 8
shr rax, cl ; 64-bit shift lets count=32 work
; result in EAX
;mov ecx, eax ; if you need the result back in ECX
For example, with ECX = '0040' (0x30343030)
'0040' ^ '0000' = '0040' (0x00040000)
BSF / TZCNT = 18
18 & -8 = 16
0x30343030 >> 16 = 0x00003034 = '40' (zero-extended in a dword)
With the ECX = '0000' special case, we get TZCNT(0) = 32. But on an older CPU that decodes TZCNT as BSF (ignoring the REP prefix), it leaves the destination unmodified, so we get 0. (dst-unmodified behaviour is documented by AMD, but implemented by both Intel and AMD on their current CPUs at least.)
All-zero is a special-case anyway if you want to reduce it to '0'
instead of the empty string, though. You might do that without extra branching by doing xor ecx, '000'
(3 ASCII 0s with a binary 0 as the high byte). The only way you can shift by 32 is when the high byte was already 0
(not '0'
). i.e. it will keep any non-zero high byte.
If you don't need TZCNT to produce 32 for an all-zero register, that makes it ok for it to decode as BSF on CPUs without BMI1; see also. So this code will work correctly on any x86-64 CPU despite being written with a BMI1 instruction. And ok to use shr eax, cl
instead of rax
, saving the REX prefix.
Note, it's called TZCNT because the low bits of a register are the "trailing" bits. But x86 is little-endian, so when dealing with strings, the first byte in printing order (lowest memory address) is the highest 8 bits of the integer value. The naming of BSF (bit scan forwards) does follow little-endian, though.
Picking any other input register would be more convenient for your ASCII data so it could stay in that register while we generate the shift count in ECX. Unless you have BMI2, then you could use shrx ecx, ecx, eax
for example to do ecx >>= eax
. It's also more efficient; only a single uop even on Intel CPU, where shr reg, cl
is 3 because of legacy x86 baggage (FLAGS has to remain unmodified if the count is 0). https://uops.info/ / https://agner.org/optimize/
You could also do this with SIMD pcmpeqb xmm0, xmm1
/ pmovmskb eax, xmm0
/ not eax
with '0000'
in xmm1. That gives you a byte-compare mask (like the XOR bit-compare mask), which you can bit-scan and scale by 8 to turn into a shift count. But that's not better vs. XOR; only seems useful if you were going to do something with SIMD shuffles to handle 16 bytes at once.
(There's no variable-count SIMD byte-shift, but maybe you could use the bit-scan result to load a sliding window from an array of db 0, 1, ..., 14, 15
/ times 16 db -1
to get a shuffle control vector for pshufb
.)
Or use the pcmpeqb
/ movd eax, xmm0
result for BMI2 pext
, except that will remove all zeros, not just the low ones. To set all the bits above the lowest 1, maybe blsi
to isolate the low bit, then neg
? That's probably more instructions and higher latency than using XOR / TZCNT / AND to get a shift count, and shifts are lower latency than PEXT even on Intel CPUs where it's 1 uop / 3c latency. (vs. being very slow and microcoded on AMD).
AVX-512VBMI2 (Ice Lake) for vpcompressb xmm1{k1}{z}, xmm2
with a mask from AVX-512BW vpcmpb k, xmm, xmm/m128, _MM_CMPINT_EQ
would have the same problem of removing all zeros.