Understanding the decompilation of an object to source code

Question

First of all, I am a student, I do not yet have extensive knowledge about C, C ++ and assembler, so I am making a extreme effort to understand it.

I have this piece of assembly code from an Intel x86-32 bit processor.

My goal is to transform it to source code.

0x80483dc <main>:    push       ebp        
0x80483dd <main+1>:  mov        ebp,esp     
0x80483df <main+3>:  sub        esp,0x10
0x80483e2 <main+6>:  mov        DWORD PTR [ebp-0x8],0x80484d0   
0x80483e9 <main+13>: lea        eax,[ebp-0x8]   
0x80483ec <main+16>: mov        DWORD PTR [ebp-0x4],eax     
0x80483ef <main+19>: mov        eax,DWORD PTR [ebp-0x4]     
0x80483f2 <main+22>: mov        edx,DWORD PTR [eax+0xc]
0x80483f5 <main+25>: mov        eax,DWORD PTR [ebp-0x4]         
0x80483f8 <main+28>: movzx      eax,WORD PTR [eax+0x10]
0x80483fc <main+32>: cwde
0x80483fd <main+33>: add        edx, eax
0x80483ff <main+35>: mov        eax,DWORD PTR [ebp-0x4]         
0x8048402 <main+38>: mov        DWORD PTR [eax+0xc],edx     
0x8048405 <main+41>: mov        eax,DWORD PTR [ebp-0x4]     
0x8048408 <main+44>: movzx      eax,BYTE PTR [eax]
0x804840b <main+47>: cmp        al,0x4f     
0x804840d <main+49>: jne        0x8048419 <main+61> 
0x804840f <main+51>: mov        eax,DWORD PTR [ebp-0x4] 
0x8048412 <main+54>: movzx      eax,BYTE PTR [eax] 
0x8048415 <main+57>: cmp        al,0x4b 
0x8048417 <main+59>: je         0x804842d <main+81> 
0x8048419 <main+61>: mov        eax,DWORD PTR [ebp-0x4] 
0x804841c <main+64>: mov        eax,DWORD PTR [eax+0xc]
0x804841f <main+67>: mov        edx, eax
0x8048421 <main+69>: and        edx,0xf0f0f0f
0x8048427 <main+75>: mov        eax,DWORD PTR [ebp-0x4] 
0x804842a <main+78>: mov        DWORD PTR [eax+0x4],edx
0x804842d <main+81>: mov        eax,0x0
0x8048432 <main+86>: leave
0x8048433 <main+87>: ret

This is what I understand from the code:

There are 4 variables:

a = [ebp-0x8] ebp
b = [ebp-0x4] eax
c = [eax + 0xc] edx
d = [eax + 0x10] eax

Values:

0x4 = 4
0x8 = 8
0xc = 12
0x10 = 16
0x4b = 75
0x4f = 79

Types:

char (8 bits) = 1 BYTE
short (16 bits) = WORD
int (32 bit) = DWORD
long (32 bits) = DWORD
long long (32 bit) = DWORD

This is what I was able to create:

#include <stdio.h>
int main (void)
{
   int a = 0x80484d0;
   int b
   short c;
   int d;

   c + b?
if (79 <= al) {
instructions
} else {
instructions
}

   return 0
}

But I'm stuck. Nor can I understand what the sentence "cmp al .." compares to, what is "al"?

How do these instructions work?

EDIT1:

That said, as you comment the assembly seems to be wrong or as someone comments say, it is insane!

The code and the exercise are from the following book called: "Reversing, Reverse Engineering" on page 140 (3.8 Proposed Exercises). It would never have occurred to me that it was wrong, if so, this clearly makes it difficult for me to learn ...

So it is not possible to do a reversing to get the source code because it is not a good assembly? Maybe I am not oppressed? Is it possible to optimize it?

EDIT2:

Hi!

I did ask and finally she says this should be the c code:

inf foo(void){
    char *string;//ebp-0x8
    unsigned int *pointerstring//[ebp-0x4]
    unsigned int *position;
    *position = *(pointerstring+0xc);
    unsigned char character;
    character=(unsigned char) string[*position];
    if ((character != 0x4)||(character != 0x4b))
    {
     *(position+0x4)=(unsigned int)(*position & 0x0f0f0f0f);
    }
    return(0);
}

Does it have any sense at all for you?, could someone please explain this to me? Does anyone really program like this?

Thanks very much!

`long long (32 bit) = DWORD` is not correct. The C++ standard requires `long long` to be 64 bits. (well, not exactly but the max value it must support is such that it needs 64 bits to store it) — NathanOliver, Jun 01 '20 at 18:57
If you need to ask what `al` is then you need some deeper study of the processor, before you can make sense of a disassembly or assembly listing. The `al` register is the least significant 8 bits of the `eax` register, and `ax` is the l.s. 16 bits of `eax`. Similar for `bl`, `cl`, `dl`, `bx`, `cx` and `dx`. Also `ah` is the next 8 bits of `eax` so that `ah` and `al` together make `ax`. — Weather Vane, Jun 01 '20 at 19:04
[Here's a link to the documentation](https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4). Look for the section titled "Basic Program Execution Registers" — user3386109, Jun 01 '20 at 19:29
You have source code posted, it's in Intel x86 assembly language. Which high level language are you converting to, C or C++? They are distinct languages. For example C++ has inheritance and `std::string` where as C doesn't. I recommend picking one language to make your project easier (don't mix them, as that makes your program more complicated). — Thomas Matthews, Jun 01 '20 at 19:32
I highly recommend writing a small function, that receives a pointer parameter, and tell your compiler to output the assembly language for it. This will show you how the compiler handles the function call overhead. Next, declare a local variable, and print out the assembly language. You are looking for patterns. You can take your high level guess, print out the assembly language and compare it to your posted assembly language. Not an easy task especially for beginners. — Thomas Matthews, Jun 01 '20 at 19:36
IMHO, the `mov DWORD PTR [ebp-0x8],0x80484d0` is loading the address of a class or structure. Some of the `mov eax` and `mov edx`, could be used to load a member variable or dereference a pointer. — Thomas Matthews, Jun 01 '20 at 19:39
@ThomasMatthews: `mov DWORD PTR [ebp-0x8],0x80484d0` is *storing* a 32-bit immediate constant to a local var on the stack (because this asm is the result of compiling something with clang or gcc `-O0` - anti-optimized debug mode that treats all variables kind of like `volatile`). That `0x80484d0` is likely a pointer, and it's in the same page as the code, so it's almost certainly in the `.rodata` section. The first few instructions after `lea eax,[ebp-0x8]` / mov are reloading that pointer-to-pointer and offsetting from that, not the pointed-to `.rodata`, unless I'm having a brain fart. — Peter Cordes, Jun 01 '20 at 21:58
@PeterCordes I don't think you're having a brain fart. I think there's something seriously weird about that assembly. In particular, after `0x80483f2
: mov edx,DWORD PTR [eax+0xc]`, it looks to me like the return address will have ended up in `edx`, which seems undesirable given what the code then goes on to do with it. — Joseph Sible-Reinstate Monica, Jun 02 '20 at 00:03
Unless the chapter is about obfuscated code, I'd write to the author. There is a trivial logic error in the code and the book doesn't pop up on Google (did you give us the right name?) so I'm inclined to believe that either the book has a wonderful chapter about control flow obfuscation or has been written by someone who shouldn't have. — Margaret Bloom, Jun 02 '20 at 11:47
Hi! Here is the book: https://books.google.es/books?id=ko6fDwAAQBAJ&pg=PA139&lpg=PA139&dq=%C2%BFQu%C3%A9+n%C3%BAmero+m%C3%A1ximo+de+variables+con+modificador+register+puede+utilizarse+dentro+de+una+funci%C3%B3n?&source=bl&ots=UICtfU0byB&sig=ACfU3U17T8LkZ6YQ93pLfxmkWvcsNhZhpQ&hl=es&sa=X&ved=2ahUKEwilu4Wdq9vpAhUGlxQKHZLkB14Q6AEwAHoECAoQAQ#v=snippet&q=3.8%20Ejercicios&f=false Honestly it is driving me crazy, the book it is not well written in general and if the exercises are also wrong, I do not know what to do ... This is the material that they have given me to study at my institute .. — conjim, Jun 02 '20 at 12:54
@conjim: I'd suggest talking to your instructor. They may not have realized the serious problems with the book, or at least with this exercise in particular. They might start checking / correcting problems before they assign them to your class, and hopefully not use that textbook next year. — Peter Cordes, Jun 14 '20 at 13:32
The C code that your professor gave you does not correspond to the assembly you originally posted, and in fact it's more bug-riddled than the assembly. — Joseph Sible-Reinstate Monica, Jun 23 '20 at 15:54
Thank you all and to you Monica. At least I am happy because I thought that the problem was me but I see that it really is the assembly that they have given me and their answer. Thank you very much — conjim, Jun 23 '20 at 17:08

score 3 · Answer 1 · answered Jun 02 '20 at 00:37

3

Your assembly is completely insane. This is roughly equivalent C:

int main() {
    int i = 0x80484d0; // in ebp-8
    int *p = &i; // in ebp-4
    p[3] += (short)p[4]; // add argc to the return address(!)
    if((char)*p != 0x4f || (char)*p != 0x4b) // always true because of || instead of &&
        p[1] = p[3] & 0xf0f0f0f; // note that p[1] is p
    return 0;
}

It should be immediately obvious that this is horrifically bad code that almost certainly won't do what the programmer intended.

answered Jun 02 '20 at 00:37

Joseph Sible-Reinstate Monica

45,431
5
48
98

1

`0x80484d0` is likely an address in the .rodata section, so I was thinking `char *arr[1] = { "something" };`. But then yes, `int *p = (int*)arr;` and UB from accessing outside the bounds of the array. I didn't look at how crazy the rest of it is. – Peter Cordes Jun 02 '20 at 01:01
I asked my teacher but he says it is because he is not optimized ... honestly, if you who know do not understand it, I doubt that I can do it either .. – conjim Jun 02 '20 at 12:59
1

@conjim "Not optimized" would mean the code does the right thing, but does it slower than necessary. This code does the wrong thing. – Joseph Sible-Reinstate Monica Jun 02 '20 at 13:47
Thanks for the information, I suppose in that case there is nothing else that can be done with this ... I thought that maybe I was not getting it right but it seems that everyone agrees that it is bad code. – conjim Jun 02 '20 at 19:34

Marcelo Roberto Jimenez · Answer 2 · 2020-06-01T20:07:17.787

The x86 assembly language follows a long legacy and has mostly kept compatibility. We need to go back to the 8086/8088 chip where that story starts. These were 16 bit processors, which means that their register had a word size of 16 bits. The general purpose registers were named AX, BX, CX and DX. The 8086 had instructions to manipulate the upper and lower 8-bit parts of these registers that were then named AH, AL, BH, BL, CH, CL, DH and DL. This Wikipedia page describes this, please take a look.

The 32 bit versions of these registers have an E in front: EAX, EBX, ECX, etc.

The particular instruction you mention, e.g, cmp al,0x4f is comparing the lower byte of the AX register with 0x4f. The comparison is effectively the same as a subtraction, but does not save the result, only sets the flags.

For the 8086 instruction set, there is a nice reference here. Your program is 32 bit code, so you will need at least a 80386 instruction reference.

score 1 · Answer 3 · answered Jun 01 '20 at 23:54

You have analyzed variables, and that's a good place to start. You should try to add type annotations to them, size, as you started, and, when used as pointers (like b), pointers to what kind/size.

I might update your variable chart as follows, knowing that [ebp-4] is b:

c = [b + 0xc]
d = [b + 0x10]
e = [b + 0], size = byte

Another thing to analyze is the control flow. For most instructions control flow is sequential, but certain instructions purposefully alter it. Broadly speaking, when the pc is moved forward, it skips some code and when the pc is moved backward it repeats some code it already ran. Skipping code is used to construct if-then, if-then-else, and statements that break out of loops. Jumping back is used to continue looping.

Some instructions, called conditional branches, on some dynamic condition being true: skip forward (or backwards) and on being false do the simple sequential advancement to the next instruction (sometimes called conditional branch fall through).

The control sequences here:

...
0x8048405 <main+41>: mov        eax,DWORD PTR [ebp-0x4]    b
0x8048408 <main+44>: movzx      eax,BYTE PTR [eax]         b->e

0x804840b <main+47>: cmp        al,0x4f                    b->e <=> 'O'
0x804840d <main+49>: jne        0x8048419 <main+61>        b->e != 'O'  skip to 61

** we know that the letter, a->e, must be 'O' here

0x804840f <main+51>: mov        eax,DWORD PTR [ebp-0x4]    b      
0x8048412 <main+54>: movzx      eax,BYTE PTR [eax]         b->e

0x8048415 <main+57>: cmp        al,0x4b                    b->e <=> 'K'
0x8048417 <main+59>: je         0x804842d <main+81>        b->e == 'K' skip to 81

** we know that the letter, a->e must not be 'K' here if we fall thru the above je 

** this line can be reached by taken branch jne or by fall thru je
0x8048419 <main+61>: mov        eax,DWORD PTR [ebp-0x4]    ******
...

The flow of control reaches this last line tagged we know that either the letter is either not 'O' or it is not 'K'.

The construct where the jne instruction is used to skip another test is a short-circuit || operator. Thus the control construct is:

if ( a->e != 'O' || a->e != 'K' ) {
    then-part
}

As that these two conditional branches are the only flow control modifications in the function, there is no else part of the if, and there are no loops or other if's.

This code appears to have a slight problem.

If the value is not 'O', the then-part will fire from the first test. However, if we reach the 2nd test, we already know the letter is 'O', so testing it for 'K' is silly and will be true ('O' is not 'K').

Thus, this if-then will always fire.

It is either very inefficient, or, there is a bug that perhaps it is the next letter along in the (presumably) string should be tested for 'K' not the same exact letter.

*`[ebp-4]` is `b`* - isn't `b` a pointer variable, holding the address `ebp-8`? So offsetting from that `[eax + 0xc]` is indexing into the function's own stack frame, with `mov edx, [eax + 0xc]` loading the function's return address. (Joseph Sible and I [discussed this in comments](https://stackoverflow.com/questions/62139013/reversing-object-to-source-code-help-to-understand-piece-of-code?noredirect=1#comment109906230_62139013); it looks insane to us. I wonder if the original code is indexing outside the bounds of a local array like `char *arr[1] = { "hello" };`?) — Peter Cordes, Jun 02 '20 at 00:19
I am still very confused ... but thank you very much for the information. Is there anyway to automatically transform object code into source code, even if it is remotely similar... but not quite? Thanks — conjim, Jun 02 '20 at 13:03

Understanding the decompilation of an object to source code

EDIT1:

EDIT2:

3 Answers3