0

This is a program that gets passed a String as input.

I'm confused with the assembler code shown below, specifically line 6. This is what i understood from my research:

  • rbp-48 is a pointer that points to the stack address where argv is stored. (argv itself, is the address pointing to the start of the argv array)
  • Now rax register stores the argv array address.
  • We then add 8 bytes to rax. This means rax now points to the address of argv[1]. (I understand there is another address stored inside argv[1] that points to a string).
  • We then access the value stored in argv[1] and store it in the rdx register. This means, rdx now points to the address were the string begins.
  • We then move the [rbp-24] = i counter variable to the eax register.
  • We then have an action cdqe which I believe it's not relevant.

And now is were I get confused: If I wanted to access the first character in argv[1] and store it in eax register, I would expect assembler to do something like:

mov   eax, BYTE PTR [rdx]

And if I need to access the second character stored in argv[1] and store it in eax register, I would expect assembler to do something like:

mov   eax, BYTE PTR [rdx+1]

But instead, I see the compiler does the following:

add     rax, rdx
  • Adds the address in memory where the string begins to the address in memory were the address that points to the start of the string is stored, and saves this result in rax.

I can not understand how does this instruction make rax point to any character in argv[1].

Below is the C code and the assembler code corresponding to the loop's instructions:

#include <string.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
int sum = 0;
for(int i = 0; i < strlen(argv[1]); i ++){
  sum += (int)argv[1][i];
}
return 0;
}

Assembly

mov     rax, QWORD PTR [rbp-48]
add     rax, 8
mov     rdx, QWORD PTR [rax]
mov     eax, DWORD PTR [rbp-24]
cdqe
add     rax, rdx
movzx   eax, BYTE PTR [rax]
movsx   eax, al
add     DWORD PTR [rbp-20], eax
add     DWORD PTR [rbp-24], 1
phuclv
  • 37,963
  • 15
  • 156
  • 475
IdkHowToRc
  • 13
  • 1
  • 3
  • [tag:assembly] must also go with the architecture. And the assembler doesn't give you `add rax, rdx`, it's the compiler that emits that instruction. The assembler just assembles the human-readable instructions to machine code – phuclv Sep 26 '18 at 03:13
  • You don't understand how adding i to argv[1] gives &argv[1][i]? – user253751 Sep 26 '18 at 03:17
  • I don't understand the confusion. You know that rdx+1 is the address of argv[1][1], and you know that rax contains i, so it should be clear that rdx+rax is the address of argv[1][i]. The next statement `movszx eax, byte ptr [rax]` loads the character at that address. – prl Sep 26 '18 at 03:30
  • Compile with optimization enabled to get easier-to-understand code. Like `gcc -Og` that keeps variables in registers and might do mild optimization, or `gcc -O3 -fno-tree-vectorize`, or `gcc -Os`. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116). This code is *not* adding `1` to a pointer, it's adding it to the dword `i`, and using `i` as an index. It's weird that your compiler isn't using an indexed addressing mode like `movzx eax, byte ptr [rax+rdx]`. Is this from MSVC? It doesn't look like what anti-optimized `gcc -O0` would do. – Peter Cordes Sep 26 '18 at 03:32
  • The code isn't complete, there should be at least one `jmp`-family instruction, probably 2. – o11c Sep 26 '18 at 03:35
  • I want to comment on your terminology. It may be related to your confusion. Several times you say that a register "points to the address of" something. It is more precise to say that a register "points to" the thing or "contains the address" of the thing. Except when you really do mean that the register points to the address of the thing, – prl Sep 26 '18 at 03:44

1 Answers1

2

Oh, I finally figured out your confusion. At the point of the instruction in question, rax no longer contains argv; it was reloaded with the value of i. The compiler is using an add instruction instead of an indexed addressing mode.

eax is the lower 32 bits of rax. When eax is loaded, the value is zero-extended to 64 bits.

And then cdqe sign-extends EAX into RAX, because i is a signed 32-bit integer that you're using to index a pointer. The compiler could have simplified by loading with
movsx rax, dword ptr [rbp-24].

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
prl
  • 11,716
  • 2
  • 13
  • 31