3

I'm new to assemby language and I have a problem with:

int main(){
  int x = 2023; //nine 1 and two 0 as in bin it's - 11111100111
  int y;
  
  asm(
    "mov eax, %1 ;"
    "mov ecx, 32 ;"
    "mov ebx, 0 ;"
    "loop:"
      "add eax, eax ;"
      "jnc next ;"
      "inc ebx ;"
      "next :"
      "dec ecx ;"
      "jnz loop;"
    "mov %0, ebx ;"

    : "=r" (y)
    : "r"(x)
    : "eax", "ebx", "ecx"
  );

The code is compiled using -masm=intel.

I want this program to count number of 0 in binary representation of x. Right now it works on counting 1 in it but I can't figure it out what needs to be changed. I tried to change couple of functionalities but it either gives me wrong value or the value is the same.

And for clarification of given code. Right now result given by it is good because it counts 1 in 2023 but i want to reverse it somehow for it to count 0 not 1. I tried couple of methods by changing add, inc, dec etc. but somehow result was either 23 or 41 (idk how)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Do it exactly how you count the ones (right-shift the register until it is 0) but additionally count the zero bits. But you are not using shifts: so it isn't immediately apparent what your algorithm is. – Weather Vane Mar 26 '23 at 17:43
  • It's probably helpful if you post the code that didn't work also. Of course we can say how to convert this code to count zeroes, but then you still won't really know why your attempt was wrong. – harold Mar 26 '23 at 17:46
  • Pseudo-code: `while(value != 0) { if((value & 1) == 1 ones++ else zeros++; value >>= 1 }` – Weather Vane Mar 26 '23 at 17:47
  • @WeatherVane `add eax, eax` is used as a shift (to the left) – harold Mar 26 '23 at 17:47
  • @harold I see, and the overflow bit to carry is counted. I guess you could also count the zeros that way too, in the same loop, but the answer will depend on the register size, whereas the required answer does not. – Weather Vane Mar 26 '23 at 17:49
  • @WeatherVane well, do we even know that? I would have assumed that counting the zeroes means *all* the zeroes, ie not ignoring leading zeroes, but perhaps leading zeroes are supposed to be ignored.. it's up to OP to clarify – harold Mar 26 '23 at 17:50
  • 1
    @harold OP says the answer to `11111100111` is `2`. The value has eleven significant bits. – Weather Vane Mar 26 '23 at 17:51
  • @WeatherVane that's not good enough for me, could easily be a mistake on OPs part too (assuming that numbers "end" at their highest set bit is an extremely common beginner mistake after all), I won't be posting any answer until OP clarifies – harold Mar 26 '23 at 17:57
  • @Harold what is unclear about `int x = 2023; //nine 1 and two 0`? – Weather Vane Mar 26 '23 at 17:58
  • @WeatherVane what's unclear about "it could easily be a mistake"? – harold Mar 26 '23 at 17:59
  • @Harold it is obvious from the fact that you can trivially subtract the number of ones from the register size. – Weather Vane Mar 26 '23 at 18:04
  • @WeatherVane or change `jnc` to `jc` or various other methods. So what? It's trivial *for us*, but we're experts. OP could have implemented one of these things incorrectly (there's plenty of ways to do it wrong, too), or even correctly but then incorrectly judged the result as wrong because he miscounted the zeroes himself. – harold Mar 26 '23 at 18:08
  • The number of 0 bits in a 32-bit word is: 32 minus the number of 1 bits in that same word. – Erik Eidt Mar 26 '23 at 18:14
  • 1
    The point of this program is to count only 0 (right now it's ) in binary representation of given int. So like i posted in a commented part I want resoult to be 2 but changing jnc to jc (from jum if not carried to jump if carried) seemed obvius but then resoult of it is 23. After reading comments i guess problem might be with all 0 in registry as 2023 does not fill it to max. The point is that i tried to flip it and it gives me weird value that I'm not so sure why I'm reciving. Also looking at compiler instructions is a little messy since I just started studing this language. – pawel_duszynski Mar 26 '23 at 18:20
  • So, you are saying that for `2023` the answer is two zero bits, as you stated in the question? And if you work with a 16-bit register say `ax` then the answer should still be two? – Weather Vane Mar 26 '23 at 18:22
  • @WeatherVane Yes that is right. The point of this program is to count 0 that make the value of x itself. I didn't even thought of registry size as its filled with leading 0 for this example. Sorry for not clarifying and thanks for the help. – pawel_duszynski Mar 26 '23 at 18:25

5 Answers5

3
unsigned int count_nonleading_zero_bits( unsigned int n ) {
   unsigned int count = 0;
   while ( n ) {
      if ( !( n & 1 ) )
         ++count;

      n >>= 1;
   }

   return count;
}

https://godbolt.org/z/asds67nbr

count_nonleading_zero_bits:
        xor     eax, eax
        test    edi, edi
        je      .L5
.L4:
        mov     edx, edi
        and     edx, 1
        cmp     edx, 1
        adc     eax, 0
        shr     edi
        jne     .L4
.L5:
        ret
ikegami
  • 367,544
  • 15
  • 269
  • 518
0___________
  • 60,014
  • 4
  • 34
  • 74
3

A solution that stays close to what you haved coded already. (Apparantly you follow a constraint about 'loop using add / jnc'.)

  • The L1 loop first locates the highest set bit in the EAX register
  • The L2 loop later tallies the embedded zero bits as requested
                     ; "mov eax, %1 ;"
    mov  ebx, 0
    mov  ecx, 32
L1: add  eax, eax
    jc   L3          ; Found highest 1 bit
    dec  ecx
    jnz  L1
    jmp  L4          ; EAX was 0
L2: add  eax, eax
    jc   L3
    inc  ebx         ; Count zeroes
L3: dec  ecx
    jnz  L2
L4:                  ; "mov %0, ebx ;"
Sep Roland
  • 33,889
  • 7
  • 43
  • 76
3

For modern x86 with BMI1 you'd want to use 32-popcnt(x) to count all the zeros in a whole register, and subtract lzcnt(x), the number of leading zeros, the ones that aren't part of the significant digits of your binary integer.

In GNU C, that looks like

int count_significant_zeros(unsigned x)
{
    if (!x)
        return 0;  // __builtin_clz has an undefined result for input 0
    
    return sizeof(x)*CHAR_BIT - __builtin_popcount(x) - __builtin_clz(x);
    // assume no padding bits.  Will be 32 when compiling for x86.
    // use popcountll and clzll for 64-bit integers.
}

Without -mpopcnt or SSE4.2, GCC and clang (Godbolt) use an SWAR bithack for __builtin_popcount, and bsr(x) ^ 31 as an optimization for 31-bsr(x) for the leading-zero count. Clang actually uses popcount(~x) as an optimization for 32-popcount(x), using a not instruction before the bithack. It probably would have been at least as efficient to do 32 - popcnt(x) - (31-bsr(x)) as 1 + bsr(x) - popcnt(x).

bsr leaves the destination register unmodified for an input of 0, so you need to special-case that. (AMD documents this, Intel documents the result as "undefined", unlike lzcnt which is well-defined to produce 32). Being able to use bsr without extra checks is why GNU C defines __builtin_clz to have an undefined result in that case. (https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)

A C++20 version of this looks like this, using std::countl_zero to count zeros on the left (leading), producing the type width for an input of 0 so we don't need to check for that special case ourselves. This lets the compiler make better asm when instructions like lzcnt are available that don't need to special-case that.

#include <bit>
#include <limits>
int count_significant_zeros_cpp20(unsigned x)
{
    return std::numeric_limits<decltype(x)>::digits - std::popcount(x) - std::countl_zero(x);
}

From clang 16 -O2 -march=haswell for x86-64:

count_significant_zeros_cpp20(unsigned int):
        popcnt  eax, edi        # missed optimization: false dependency could be avoided with popcnt edi,edi
        lzcnt   ecx, edi        # Haswell also has a false output dependency on lzcnt, although Skylake doesn't.
        add     ecx, eax
        mov     eax, 32
        sub     eax, ecx
        ret

If you wanted to use inline asm for this, you might do it this way

int count_significant_zeros_asm(unsigned x)
{
  int retval, dummy;
  asm(
    "lzcnt  %[dummy], %[x] \n\t"     // false dependency on Intel before Skylake; could xor-zero [dummy] first to break it
    "popcnt %[x], %[x] \n\t"         // avoid false dependencies on Intel before Ice Lake
    "mov    %[ret], 32 \n\t"
    "sub    %[ret], %[dummy] \n\t"   // latency of this can overlap with popcnt latency
    "sub    %[ret], %[x] \n\t"

    : [ret]"=r" (retval), [dummy]"=&r"(dummy),  // let the compiler pick registers, dummy being an early-clobber output written before we read x for the last time.  Although [x] is an RMW inout operand so I don't think it can overlap
      [x]"+r"(x)                               // modify the input reg in place
    : // no pure inputs
    : // no clobbers
  );
  return retval;
  // dummy is also available with clz(x) in case you want it.
}

If you wanted to loop over bits, shifting right with shr makes a lot more sense; you can stop looping as soon as the value becomes 0, as in ikegami's rewrite of the community-wiki answer. (Which was originally just a std::popcount() counting ones, not even counting down from 32 to count zeros. You could do that with sbb reg, 0.)

You're shifting left with add, which means you have to loop through the leading zeros in a fixed-width register before getting to the leading 1 of the integer.

BTW, clang 14 and later supports -masm=intel for Intel syntax in asm("") statements; before that it only affected the output syntax, not how its built-in assembler treated asm statements.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Found a typo in "...instructions like `lznct` are available..." (`lzcnt`) – Sep Roland Mar 26 '23 at 20:18
  • @SepRoland Thanks, fixed. Feel free to edit simple typos yourself, especially in my answers, but generally that's encouraged when you're sure what an answer meant to say. (Especially for users with more than 2k rep so it bypasses the review queue.) – Peter Cordes Mar 26 '23 at 20:55
  • I could have, and normally I would have edited it, but I somehow felt the answer was still 'a work in progress' – Sep Roland Mar 26 '23 at 21:00
  • @SepRoland: Well, it is kind of a few different ideas jotted down quickly, but I was planning to stop there. Did you have any suggestions for what would make it look more "finished"? – Peter Cordes Mar 26 '23 at 21:01
  • There you got me cornered. I wouldn't dream of suggesting changes about parts of the instruction set I'm not familiar with. (But that I do love to read about.) – Sep Roland Mar 26 '23 at 21:05
  • @SepRoland: I'd been thinking in terms of maybe some more or changed text that tied the answer together more instead of just 3 random different sections. And/or I was really wondering if you could describe what gave an impression that it might be unfinished, like perhaps that the last section talks about looping implementations but doesn't show one? Multiple other answers covered that with code but very little text, so I just wrote some text about it. – Peter Cordes Mar 26 '23 at 21:12
2

An alternative:

unsigned int count_nonleading_zero_bits(unsigned int n) {
   unsigned int count = 0;
   while (n) {
      count += 1 - (n & 1);
      n >>= 1;
   }
   return count;
}

x86_64 Assembly:

count_nonleading_zero_bits:
        xor     edx, edx
        test    edi, edi
        je      .L1
.L3:
        mov     eax, edi
        not     eax
        and     eax, 1
        add     edx, eax
        shr     edi
        jne     .L3
.L1:
        mov     eax, edx
        ret

Hand optimized (not necessarily faster):

count_nonleading_zero_bits:
        xor     eax, eax   ;; set CF=0
.L3:
        adc     eax, 0
        shr     edi
        cmc
        jne     .L3
        ret
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    Counting 1s downward with `sbb eax,0` (starting from `bsr eax, edi` / `inc eax`) might be even more efficient inside the loop, although you need to check for the input being zero in that case as well as calculating MSB pos. I expect your hand-optimized version *is* also efficient, as `cmc` is only 1 uop with 1c latency, and the loop-carried dependency chains are still only 1c long. (They connect to each other similar to [Latency bounds and throughput bounds for processors for operations that must occur in sequence](https://stackoverflow.com/q/63095394) with `cmc` or `mov/not/and` connecting – Peter Cordes Mar 26 '23 at 19:55
  • Does it work to do `sbb eax, -1`, to add 0 when CF=1, add 1 when CF=0? Yes, I think so, with an `stc` outside the loop instead of `cmc` inside. It'll be slower on Intel Haswell and earlier, where `adc eax, 0` was special-cased as 1 uop but non-zero immediates were 2 uops, because both uops are on the critical path for latency from eax to eax. (https://uops.info/) Unlike here where `adc eax, 0` is just 1c latency for EAX to EAX, with `cmc` feeding into but not part of the loop-carried dependency chain. (Before Sandybridge, even `adc eax,0` is 2 uops, but that's ancient history.) – Peter Cordes Mar 26 '23 at 21:28
  • @PeterCordes: I was hesitant about `sbb eax,-1`, but your analysis is impressive. – chqrlie Mar 26 '23 at 21:46
1

Thank you all for help not only with code but also with explanations! I really appriciate it.

This is changed code (thank you Sep Roland) with comments added that explain inerworkings of it.

  asm(
    "mov eax, %1 ;"           // register where binary representation of int x is stored
    "mov ebx, 0 ;"            // register where the result will be stored
    "mov ecx, 32 ;"           // loop counter
    "loop1:"                  // check and move to first one in the binary value
      "add eax, eax ;"        // add value to itself
      "jc loop3 ;"            // find the first one - jump to loop3 to decrease the counter
      "dec ecx ;"             // decrement the counter
      "jnz loop1 ;"           // if counter is not zero, repeat the loop
      "jmp loop4 ;"           // jump to l4 when eax = 0, when the counter has been zeroed (case where there are no 0s in the register)
    "loop2:"                  // main loop checking the number of zeros
      "add eax, eax ;"        // add value to itself
      "jc loop3 ;"            // if carry occurs, jump to loop3
      "inc ebx ;"             // increment the value holding the number of zeros
    "loop3:"                  // check counter status
      "dec ecx ;"             // decrement the counter
      "jnz loop2 ;"           // if counter is not zero, jump to loop2
    "loop4:"
      "mov %0, ebx ;"         // end of operation

    : "=r" (y)
    : "r"(x)
    : "eax", "ebx", "ecx"
  );

For clarification why I choosed this one. It's because the length and ease of understanding seemd the best for me. I previously stated that I just started studing this language and it's very diferent form high-level languages that I'm used to.

Sorry for english as it's my second language.

  • 1
    Like I showed in my answer, you can and should avoid using `mov` as the first and last instructions of your asm template. It's silly to get the compiler to pick a register for your input variable, and then copy it to another register instead of just using it. That ends up needing more registers, and you're forcing one of them to be EBX which the compiler will have to save/restore. If you want to write GNU C inline asm at all, instead of just playing around with a whole function written in asm in a `.S` file, you shouldn't defeat the purpose by writing excess instructions. – Peter Cordes Mar 26 '23 at 21:16
  • @PeterCordes Thanks for a advice. I'll try to mind it in next codes. – pawel_duszynski Mar 26 '23 at 21:24
  • To me it seems much easier to understand shifting right until you've shifted out all the non-zero bits. Having to skip leading zeros seems complicated, taking an extra loop. That's how you can make your original idea work, but there are better ways to do it; I wrote about this in the section at the bottom of my answer about if you want to loop, shift right, so you get to the interesting bits first. – Peter Cordes Mar 26 '23 at 21:29