C Function to Step through its own Assembly

Question

I am wondering if there is a way to write a function in a C program that walks through another function to locate an address that an instruction is called.

For example, I want to find the address that the ret instruction is used in the main function.

My first thoughts are to make a while loop that begins at "&main()" and then looping each time increments the address by 1 until the instruction is "ret" at the current address and returning the address.

Different x86 instructions have different lengths, and may include _immediate_ data values that could be mistaken for a 'ret' instruction. Also, a function could have more than one 'ret' instruction, or have variants of the 'ret' instruction. — Ian Abbott, Nov 27 '17 at 19:03
@OliverCharlesworth I don't know how I would interpret what instruction is at the current address as I increment though main(). — Ethan, Nov 27 '17 at 19:03
@IanAbbott Assuming that the functions are extremely simplistic and essentially useless, such as all that they contain are a nop and a ret, is there a way to interpret the instruction given an address? — Ethan, Nov 27 '17 at 19:05
What is the purpose of this? What are you actually trying to accomplish by doing this? — dbush, Nov 27 '17 at 19:10
@dbush I'm trying to write a function that back traces and prints a count of how many functions deep it is and stops when it reaches the main function. Example: main() calls foo() calls bar() calls print(). print looks back and says, print 1, 2, 3...until it recognizes that its back within the bounds of main(). — Ethan, Nov 27 '17 at 19:14
GNU libc has some backtrace functions that a program can call to do this sort of thing as long as the program has not been compiled to omit frame pointers from the stack frames. — Ian Abbott, Nov 27 '17 at 19:20
@ethanelle: Tracing function calls back to `main` is different from disassembling instructions. — Eric Postpischil, Nov 27 '17 at 19:23
@EricPostpischil I say disassembling instructions because in this specific instance within the big picture, I only want to find the bounds of main() so the program can recognize whether an address is inside or outside of the main function. Is there an easier way to trace functions that does not consider the base case of being inside main()? — Ethan, Nov 27 '17 at 19:28
@ethanelle: Finding the current function call chain is usually done by tracing the stack. Each function leaves information on the stack. Most especially, the return address for the prior function must be saved. (Although the most recent return address might be held in a register and not saved to the stack unless and until another call is about to be performed.) Tracing back on the stack used to be “easy,” as there were clear links from frame to frame, so you just had to follow the links to the last one. Unfortunately, as Ian Abbott mentions, programs may be compiled not to use frame pointers,… — Eric Postpischil, Nov 27 '17 at 19:52
… in which case the desired information is not directly available, and tracing back can be hard. In any case, you should probably enter a new question, asking something like “How can I write a program that traces back its own stack frames?” (But search for existing questions first.) — Eric Postpischil, Nov 27 '17 at 19:53
OP: you should certainly look into forcing compiler to provide stack-frame ("-fno-omit-frame-pointer" in gcc) and look into the classic methods of stack backtracing. If this is viable for you, it will work much better than any ad-hoc heuristic based on something else, plus it's largely supported by tools, maybe even some runtime functions in clib? (never used that, so I'm not sure). — Ped7g, Nov 27 '17 at 21:08
other option (related to this question) is to change `main` to store it's own address (near the `call` into app body) to some global variable, so you can check stack content against that address +-few bytes, but I don't see how to proceed further, how to find the call depth even if you know which stack value is return into `main`, still the functions between may allocate different amount of memory per call and you can't tell apart return address value from ordinary value. — Ped7g, Nov 27 '17 at 21:14
@EricPostpischil yes, but that may be quite some distance from the actual `body` call, and the OP want to use that address to decide which return address in stack is going into `main`. As the C compiler is not obliged to even produce `main` code in single assembly block (it may interleave it with other function code and jump around, if it does wish to do so), using a value very near to the expected return address would give it better probability to find the true address into main. It's still very silly and error prone way to achieve what the OP is looking for, may stop working with next build. — Ped7g, Nov 28 '17 at 13:01

score 4 · Answer 1 · answered Nov 27 '17 at 19:19

4

It is certainly possible to write a program that disassembles machine code. (Obviously, this is architecture-specific. A program like this works only for the architectures it is designed for.) And such a program could take the address of its main routine and examine it. (In some C implementations, a pointer to a function is not actually the address of the code of the function. However, a program designed to disassemble code would take this into an account.)

This would be a task of considerable difficulty for a novice.

Your program would not increment the address by one byte between instructions. Many architectures have a fixed instruction size of four bytes, although other sizes are possible. The x86-64 architecture (known by various names) has variable instruction sizes. Disassembling it is fairly complicated. As part of the process of disassembling an instruction, you have to figure out how big it is, so you know where the next instruction is.

In general, though, it is not always feasible to determine which return instruction is the one executed by main when it is done. Although functions are often written in a straightforward way, they may jump around. A function may have multiple return statements. Its code may be in multiple non-contiguous places, and it might even share code with other functions. (I do not know if this is common practice in common compilers, but it could be.) And, of course main might not ever return (and, if the compiler detects this, it might not bother writing a return instruction at all).

(Incidentally, there is a mathematical proof that it is impossible to write a program that always determines whether a program terminates or not. This is called the Halting Problem.)

answered Nov 27 '17 at 19:19

Eric Postpischil

195,579
13
168
312

Thanks for such an in-depth response! I understand that a lot of it is architecture specific. But if you could assume these variables are ideal constants (main is really simple, calls one function, then has a return 0 statement and that's it.), can a program read bytes until it recognizes as a byte as a return instruction? – Ethan Nov 27 '17 at 19:25
No, there is no guarantee that part of the address that the call instruction jumps to is not the same as a ret instruction. As stated this is a task of considerable complexity even at its most very basic. – SoronelHaetir Nov 27 '17 at 19:39
The first thing you need is to be able to find instruction boundaries, which means you need their *length*. Here is my code to compute the lenght of instructions for x86-32: https://stackoverflow.com/a/23843450/120163 With that, you can identify the opcode bytes reliably, and thus decode the instructions. You can use that to step through the instructions in a procedure hunting for whatever you like. – Ira Baxter Nov 27 '17 at 19:41
gcc/clang don't merge common tails of functions. They're often bad at merging return paths even with the same function, e.g. it's common for `jle .L2` to go to a `ret` in its own block, instead of jumping to the `ret` at the end of another block. Also, code for a function isn't interleaved with the code for other functions, unless maybe a block is marked `cold` and put in its own section. I think this does happen for some exception-taking paths. – Peter Cordes Nov 28 '17 at 12:13
@ethanelle: What you propose would work fine on a fixed-width ISA like MIPS or ARM (without Thumb2 or MIPS16). Although it's complicated on 32-bit ARM because there are multiple ways to return. (`bx lr`, or pop into `PC` because the program counter is one of the 16 registers.) MIPS should be pretty consistent with `jr $ra`, but I think code could reload the return address into any other register... Anyway, you won't find the 4-byte return instruction as data in another instruction, because all instructions are 4 bytes. – Peter Cordes Nov 28 '17 at 12:21
@ethanelle: If you restrict all the functions to simple code, then, yes, it is possible, in many C implementations, to examine your program’s own code and trace back to `main`. (Note that simply converting the address of a function to a pointer to an object, even a `char`, is not defined by the C standard. Any results depend on the C implementation you use.) With these restrictions, there is little practice use for the task; it is merely an academic exercise. – Eric Postpischil Nov 28 '17 at 13:19

C Function to Step through its own Assembly

1 Answers1