0

I'm writing a tool for program analysis. I want to find all code regions (i.e. regions with machine instructions) inside the program. I also want to consider dynamically linked code. This far, I have a main process that forks and attaches as a tracer to a sub-process. I read from the procfs /proc/<pid>/maps to find all executable m-mapped regions (and redo it every time a mmap syscall is performed to support dynamic linking), but then I am stuck. I believe all virtual-memory regions that are marked executable in the vm-map are in ELF format.

I have tried looking at all section headers with executable permissions in-memory, but sometimes their offset in the ELF is outside the mapped space, and as such not inside the executable. I believe I might need to look at the program headers, but I can't get it right.

Any help would be much appreciated!

Olle
  • 107
  • 2
  • 8

1 Answers1

0

I have tried looking at all section headers

Section headers are not required to be present in the binary at all (they are only used during the stack linking), and are certainly not required to be present in memory.

I believe I might need to look at the program headers, but I can't get it right.

Looking at program headers should give you the segments, yes. But executable segments may contain read-only data and many other things, so they will give you a superset of all static "code regions".

In addition, new "code regions" could be created without any associated ELF files (think mmap + just-in-time compilation / code generation).

It's unclear why you want to find all code regions in the first place (see http://xyproblem.info). Usually binary analysis tools discover them by tracing basic blocks (i.e. by intercepting all control transfer instructions).

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • Yeah I get that. As said, I also watch for mmap calls to potentially handle code generation. The task at hand is to find and mark all branching instructions for a given program (and their loaded libraries). I am aware I could do it by tracing, but thought this would be more interesting. So if i get you right; the only way to find ALL code (including generated code) is by tracing the program (potentially from the entrypoint)? – Olle Jan 12 '23 at 13:40
  • @Olle I wouldn't say it's the _only_ way. You could in theory statically trace all reachable code by analyzing basic blocks, starting from `_start`. There is existing literature suggesting people have done this. But AFAIU this is infeasibly slow for non-toy binaries. – Employed Russian Jan 12 '23 at 15:29