-1

i'm new to X86 Assembly programming and i have a few beginner questions i cant seem to find the answer to. I have the following code, that i for the most part, understand:

        mov    edx,edi
        shr    edx,1
        and    edx,0x55555555
        sub    edi,edx
        mov    eax,edi
        shr    edi,0x2
        and    eax,0x33333333
        and    edi,0x33333333
        add    edi,eax
        mov    eax,edi
        shr    eax,0x4
        add    eax,edi
        and    eax,0xf0f0f0f
        imul   eax,eax,0x1010101
        shr    eax,0x18
        ret

My questions are, first of all, what value gets put into edx on the first line?

My main question though, is what are the hexadecimal values and what do they do... i converted them to decimal to see if they had any special meaning but they seem to be random hexadecimal values and i don't quite understand what they do.

Last question is, if i have a value in edx and i do:

mov eax,edi

Does this remove the value from edi or does it just make a copy of it and store it in the eax register?

Any help would be appreciated.

john doe
  • 27
  • 3
  • Isn't it one of those clever algorithms to calculate number of set bits in 32b register, or something similar? (parity is more likely) – Ped7g Jun 09 '17 at 19:42
  • And why hexadecimal format of values... You can read particular bits from hexadecimal format, as each hexadecimal digit is formed by exactly 4 bits (0-15 value). From decimal value you can't do that in head, especially with 32 bit values (sort of doable with 8 bit, but still hexa is much more convenient and easier). So for example `0xAA` = 8+2 (*16), 8+2 => `1010 1010`. But in decimal it's 170 (128+32+8+2) ... hard to guess, which bits are hidden in the 170. – Ped7g Jun 09 '17 at 19:51
  • @Ped7g - it's a count the set bit sequence, a popcnt for edi. At the `and eax,0x33333333`, eax ends up with counts in every other 2 bit field. At the `and eax,0x0f0f0f0f`, eax ends up with counts in every other 4 bit field. The imul "sums" up the 4 bit counts into bit offset 24 (thus the right shift of 0x18). – rcgldr Jun 10 '17 at 14:49

3 Answers3

2

An explanation of the code. It generates the number of bits set in edi. Commented code. It would help to step through the code to see what is going on. The clever code is the initial instructions. Looking at each of the 16 2 bit fields, the values can be 0,1,2,3. The right shift, and 5, puts the msb into the lsb of each 2 bit field. The subtract (edi -= edx) converts each 2 bit field into a count, 0->0, 1->1, 2->1, 3->2. After that, 2 bit fields are added to form 4 bit fields, then 4 bit fields added to form 8 bit fields, and a multiply used to sum up the 8 bit fields.

        mov     edx,edi                 ;edx = edi
        shr     edx,1                   ;mov upr 2 bit field bits to lwr
        and     edx,055555555h          ; and mask them
        sub     edi,edx                 ;edi = 2 bit field counts
                                        ; 0->0, 1->1, 2->1, 3->1
        mov     eax,edi
        shr     edi,02h                 ;mov upr 2 bit field counts to lwr
        and     eax,033333333h          ;eax = lwr 2 bit field counts
        and     edi,033333333h          ;edx = upr 2 bit field counts
        add     edi,eax                 ;edi = 4 bit field counts
        mov     eax,edi
        shr     eax,04h                 ;mov upr 4 bit field counts to lwr
        add     eax,edi                 ;eax = 8 bit field counts
        and     eax,00f0f0f0fh          ; after the and
        imul    eax,eax,01010101h       ;eax bit 24->28 = bit count
        shr     eax,018h                ;eax bit 0->4 = bit count
rcgldr
  • 27,407
  • 3
  • 36
  • 61
1
#include <stdio.h>

unsigned int fun ( unsigned int edi )
{
    unsigned int edx;
    unsigned int eax;

    //mov    edx,edi
    edx=edi;
    //shr    edx,1
    edx>>=1;
    //and    edx,0x55555555
    edx&=0x55555555;
    //sub    edi,edx
    edi-=eax;
    //mov    eax,edi
    eax=edi;
    //shr    edi,0x2
    edi>>=2;
    //and    eax,0x33333333
    eax&=0x33333333;
    //and    edi,0x33333333
    edi&=0x33333333;
    //add    edi,eax
    edi+=eax;
    //mov    eax,edi
    eax=edi;
    //shr    eax,0x4
    eax>>=4;
    //add    eax,edi
    eax=edi;
    //and    eax,0xf0f0f0f
    eax&=0x0F0F0F0F;
    //imul   eax,eax,0x1010101
    eax=eax*0x01010101;
    //shr    eax,0x18
    eax>>=0x18; //24
    //ret
    return(eax);
}


int main ( void )
{
    unsigned int ra;
    unsigned int rb;
    for(ra=0;ra<100;ra++)
    {
        rb=fun(ra);
        printf("0x%08X 0x%08X\n",ra,rb);
    }
    printf("0x%08X\n",fun(0xFFFFFFFF));
    printf("0x%08X\n",fun(0x55555555));
    printf("0x%08X\n",fun(0xAAAAAAAA));
    printf("0x%08X\n",fun(0x33333333));
    return(0);
}

as mentioned in the other answer this is some tricky pattern perhaps or algorithm, not yet obvious to me, also looks just random-ish as the author just wants to see if you understand the various operations and doesnt have some trick up their sleeve...

Hopefully the above clears the air on what is going on.

Note how nicely the assembly syntax translates visually to a real or pseudocode with the destination on the left, there is no flip flop required to read the asm nor the C. Intel, arm, and other syntaxes make it very easy to write and read. Dont have to twist your brain around. Eventually if you learn enough assembly languages though, you can go both ways. Its like big endian vs little endian.

old_timer
  • 69,149
  • 8
  • 89
  • 168
0

This appears to be Intel ass-backward syntax, which means (a) all instructions take the destination as their first operand, and (b) a bare numeric literal is an immediate operand, not a memory address.

Therefore:

what value gets put into edx on the first line?

Whatever was in edi when control reaches the first line, will be copied into edx. If this is the body of a function, then it's at least plausible that edi holds its single integer argument; that would be consistent with one of the common calling conventions for x86-64; however, I note that there is nothing else in here to indicate that this is 64-bit code, and in 32-bit mode it would be on the stack instead.

what are the hexadecimal values and what do they do

They are just numbers. Several of them are repeating binary patterns, e.g. 0x55555555 is 0b01010101010101010101010101010101 and 0x33333333 is 0b00110011001100110011001100110011 — so, for instance, and edx, 0x55555555 clears all of the even-numbered bits of register edx.

It appears that the code overall is doing some kind of clever arithmetic cleverness, but I cannot tell what mathematical function it computes just by looking at it.

Does [mov eax,edi] remove the value from edi or does it just make a copy of it

It just makes a copy.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • 5
    its not ass backwards it is correct syntax, AT&T used ass backward and then gcc adopted that and now we are stuck with it to some extent. – old_timer Jun 09 '17 at 19:30
  • I have seen something about 32-bit applications sometimes using the 64-bit calling convention, although I don’t remember if that was something that was actually used anywhere. The function might be using that. – Daniel H Jun 09 '17 at 19:33
  • y = mx+b makes sense (destination on the left) mx + b = y does not when y is the destination. intel and a few others makes sense, the rest do not. – old_timer Jun 09 '17 at 19:33
  • @old_timer AT&T syntax is a generic syntax that was used (with different mnemonics, of course) for minicomputers that were older than the 8086. Therefore, Intel is the one that has the operands backward. – zwol Jun 09 '17 at 19:33
  • It depends on whether you want to read it as “Move edi into edx” or “edx = edi”. Both orderings make some amount of sense; whichever you want to use is fine. – Daniel H Jun 09 '17 at 19:37
  • @old_timer the y = mx+b would make even more sense, if Intel would use `ld/load` instead of `mov`... but I still prefer Intel syntax by huge margin (on x86), because of the almost consistent memory/immediate/register syntax, and memory addressing being written in logical way (`[edx+esi*4+14] vs 14(edx,esi,4)`). – Ped7g Jun 09 '17 at 19:39
  • @Ped7g Oddly enough, my biggest beef with Intel syntax is not the operand order, but the memory addressing notation, which I find to be egregiously over-verbose. This probably has more to do with my having learned m68k assembly language first, than anything else. – zwol Jun 09 '17 at 19:45
  • @DanielH in hand written assembly you can of course use any calling convention, but the old Watcom C++ compiler did use registers for argument passing in 32b mode by default (~1995-2000), and there's something as "fastcall" convention even sort of standardized I think.. not sure, I mostly worked in pure asm, so I didn't follow any calling convention, every procedure was original with optimal API for it. zwol: sure, I started with Z80, so x86-Intel felt reasonably familiar. – Ped7g Jun 09 '17 at 19:46
  • @Ped7g What I was thinking of was a mis-remembering of [the x32 ABI](https://en.wikipedia.org/wiki/X32_ABI), which actually allows otherwise 64-bit programs to use 32-bit pointers, but the programs are still 64-bit. It looks like the common fastcall implementation passes the first argument in ecx, not edi. This is probably x64 assembly which only uses 32-bit values, as might be produced if you compiled a C program full of `uint32_t`s. – Daniel H Jun 09 '17 at 20:07
  • 4
    @zwol the company that invents the processor invents the assembly language generally, destination makes the most sense as these are often math operations where the destination is typically on the left. the first tools for a particular processor will conform to the design, so I wasnt there but would be very surprised if AT&T had a full set of tools before intel did for intel's processor. although mangling attempts have happened with other instruction sets it is not as bad as what AT&T and then gcc did with x86. – old_timer Jun 09 '17 at 20:10
  • I was around then just didnt start programming for a few more years after, was still the 8088/8086 when I started on it... – old_timer Jun 09 '17 at 20:22
  • @old_timer I can see where you're coming from, but I reject it: having to learn a new basic syntax for assembly, as well as new mnemonics and operand constraints, for every CPU, is a Bad Thing, and an additional demerit for Intel syntax. – zwol Jun 09 '17 at 21:27
  • Yeah, that's exactly why many people hate AT&T syntax. Most other widely used modern asm syntaxes have destination first, source(s) later, so [PDP 11-derived AT&T syntax](https://stackoverflow.com/questions/42244028/what-was-the-original-reason-for-the-design-of-att-assembly-syntax) for x86 *is* the odd one out. (including GAS syntax for ARM, MIPS, and newer ISAs like RISC-V and AArch64 where GAS and GCC follow the vendor syntax, instead of copying some 3rd-party thing). AT&T syntax is ok in isolation, but not worth being different from the vendor documentation. – Peter Cordes Aug 11 '20 at 23:43
  • Also, FLAG conditions like `jge` have the correct semantic meaning after `cmp eax, ecx` (eax >= ecx), but backwards after `cmp %ecx, %eax` (still eax >= ecx). And AT&T syntax also has a design mistake with non-commutative x87 instructions, depending on their operands. See the *[AT&T syntax bugs](https://sourceware.org/binutils/docs/as/i386_002dBugs.html)* section in the GAS manual. Fortunately x87 is mostly obsolete these days, but that bug is a blight upon AT&T syntax. It's a waste of time to have to know 2 syntaxes to write GCC bug reports (where they want AT&T) and read Intel manuals. – Peter Cordes Aug 11 '20 at 23:48
  • @PeterCordes You're not going to change my mind about this and I'm not interested in discussing it any more than I already did upthread. – zwol Aug 12 '20 at 00:59
  • That's fine, I can agree with you as far as saying it would have been nice if some standardization had caught on and designers of other new asm syntaxes had followed on the consistency bandwagon that AT&T tried to start. I just wanted to present those arguments, and the link about how/why AT&T syntax was designed that way, for future readers who might be curious. – Peter Cordes Aug 12 '20 at 01:06