0

I'm trying to remove all of the '0's from the beginning of a string in NASM.

mov ecx, "0001"  ; ecx now holds 0x31303030
-> code that clears the first 3 zeros so that ecx will hoad 0x31 only

How can I remove them?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Yonatan
  • 1
  • 1
  • 1
    ECX is always 32 bits wide. What do you want in the rest of the register, once you get `0x31` into the low byte? – Peter Cordes Apr 16 '21 at 05:46
  • 1
    This looks unusual. Most of the time when manipulating strings, they are in memory, not in a register. And the simplest way to remove leading zeros from a string in memory is just to return a pointer to the first non-zero character and work with that pointer as a new string. – Nate Eldredge Apr 16 '21 at 14:51

2 Answers2

1

Look at cl - if it doesn't equal '0' you're done, otherwise shift ecx one byte to the right and repeat.

remove_zeroes:
    cmp cl,'0'
    jne done
    shr ecx,8
    jmp remove_zeroes
done:

If you can spare another register, this could be improved a bit:

remove_zeroes:
    mov eax,ecx
    shr ecx,8
    cmp al,'0'
    je remove_zeroes
mov ecx,eax
Michael
  • 57,169
  • 9
  • 80
  • 125
  • That's one option, but pretty inefficient. You could put the cmp/jne at the bottom, if you put a branch outside the loop to check if SHR should run even once. – Peter Cordes Apr 16 '21 at 05:47
  • BTW, I posted an answer with a loop that checks once before falling into a do{}while loop, so you don't need a spare register, and don't do a wasted `shr` on the last iteration. Costs a little bit of extra code-size, but fewer instructions actually executed for any given input. Your way is ok if optimizing purely for code-size, my way is what a normal compiler like `gcc -O2` would do. – Peter Cordes Apr 16 '21 at 22:40
  • @PeterCordes GCC (10.3) does generate the variant with an initial check before the loop. Clang (12.0.0) generates the variant with two registers and no initial check. – Michael Apr 17 '21 at 07:59
  • Fair enough, interesting choice. https://godbolt.org/z/5PnnboKaP. I guess it could be worth saving a bit of L1i cache space and uop-cache space (and space in the binary) that way, but I still think it's most likely a missed optimization. I'm curious if this was intentional by clang devs after comparing tuning choices, or if it's a side-effect of their optimizer that they haven't looked at one way or another. – Peter Cordes Apr 17 '21 at 08:35
1

Assuming you want the high bytes to be 0 (not '0') when you're done, @Michael's loop will work even the 2nd one isn't quite as efficient as it could be. See Why are loops always compiled into "do...while" style (tail jump)? for standard efficient loop structure, i.e. peeling the compare/branch part of the first iteration so we can rotate the loop.

    cmp  cl, '0'
    jne  .done           ; low byte already not ASCII '0'
 .remove_zeros:
    shr  ecx,8
    cmp  cl,'0'
    je   .remove_zeros
 .done:

There are other interesting things you could do: branchless non-looping

For example using ctz(ecx ^ '0000') to get the position of the first bit-difference (counting from the bottom of the register), then rounding it down to a multiple of 8 bits as a shift count. Handling the case where all 4 bytes are '0' requires some care because it requires a shift-count of 32, but BSF can't produce that, and 32-bit shifts mask the count so it would wrap.

However, implementing ctz as TZCNT will produce 32 for an all-zero input, and a 64-bit operand-size for the shift will let it work, making with &63 instead of &31. (shr docs)

   mov   eax, ecx
   xor   ecx, '0000'                   ; or '000' to always keep the highest byte
     ; jz  special_case_all_zero       ; optional
   tzcnt ecx, ecx                  ; decodes as BSF on older CPUs, which is ok for non-zero ECX
   and   ecx, -8                   ; clear low 3 bits = round down to multiple of 8
   shr   rax, cl                   ; 64-bit shift lets count=32 work
; result in EAX

   ;mov   ecx, eax           ; if you need the result back in ECX

For example, with ECX = '0040' (0x30343030)

'0040' ^ '0000' = '0040' (0x00040000)
BSF / TZCNT = 18  
18 & -8 = 16  
0x30343030 >> 16 = 0x00003034 = '40' (zero-extended in a dword)

With the ECX = '0000' special case, we get TZCNT(0) = 32. But on an older CPU that decodes TZCNT as BSF (ignoring the REP prefix), it leaves the destination unmodified, so we get 0. (dst-unmodified behaviour is documented by AMD, but implemented by both Intel and AMD on their current CPUs at least.)

All-zero is a special-case anyway if you want to reduce it to '0' instead of the empty string, though. You might do that without extra branching by doing xor ecx, '000' (3 ASCII 0s with a binary 0 as the high byte). The only way you can shift by 32 is when the high byte was already 0 (not '0'). i.e. it will keep any non-zero high byte.

If you don't need TZCNT to produce 32 for an all-zero register, that makes it ok for it to decode as BSF on CPUs without BMI1; see also. So this code will work correctly on any x86-64 CPU despite being written with a BMI1 instruction. And ok to use shr eax, cl instead of rax, saving the REX prefix.


Note, it's called TZCNT because the low bits of a register are the "trailing" bits. But x86 is little-endian, so when dealing with strings, the first byte in printing order (lowest memory address) is the highest 8 bits of the integer value. The naming of BSF (bit scan forwards) does follow little-endian, though.

Picking any other input register would be more convenient for your ASCII data so it could stay in that register while we generate the shift count in ECX. Unless you have BMI2, then you could use shrx ecx, ecx, eax for example to do ecx >>= eax. It's also more efficient; only a single uop even on Intel CPU, where shr reg, cl is 3 because of legacy x86 baggage (FLAGS has to remain unmodified if the count is 0). https://uops.info/ / https://agner.org/optimize/


You could also do this with SIMD pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0 / not eax with '0000' in xmm1. That gives you a byte-compare mask (like the XOR bit-compare mask), which you can bit-scan and scale by 8 to turn into a shift count. But that's not better vs. XOR; only seems useful if you were going to do something with SIMD shuffles to handle 16 bytes at once.

(There's no variable-count SIMD byte-shift, but maybe you could use the bit-scan result to load a sliding window from an array of db 0, 1, ..., 14, 15 / times 16 db -1 to get a shuffle control vector for pshufb.)

Or use the pcmpeqb / movd eax, xmm0 result for BMI2 pext, except that will remove all zeros, not just the low ones. To set all the bits above the lowest 1, maybe blsi to isolate the low bit, then neg? That's probably more instructions and higher latency than using XOR / TZCNT / AND to get a shift count, and shifts are lower latency than PEXT even on Intel CPUs where it's 1 uop / 3c latency. (vs. being very slow and microcoded on AMD).

AVX-512VBMI2 (Ice Lake) for vpcompressb xmm1{k1}{z}, xmm2 with a mask from AVX-512BW vpcmpb k, xmm, xmm/m128, _MM_CMPINT_EQ would have the same problem of removing all zeros.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847