Overload symbols of running process (LD_PRELOAD attachment)

Question

I'm working on a heap profiler for Linux, called heaptrack. Currently, I rely on LD_PRELOAD to overload various (de-)allocation functions, and that works extremely well.

Now I would like to extend the tool to allow runtime attaching to an existing process, which was started without LD_PRELOADing my tool. I can dlopen my library via GDB just fine, but that won't overwrite malloc etc. I think, this is because at that point the linker already resolved the position dependent code of the already running process - correct?

So what do I do instead to overload malloc and friends?

I am not proficient with assembler code. From what I've read so far, I guess I'll somehow have to patch malloc and the other functions, such that they first call back to my trace function and then continue with their actual implementation? Is that correct? How do I do that?

I hope there are existing tools out there, or that I can leverage GDB/ptrace for that.

I just stumbled upon ltrace, which is supposed to support runtime attachement, but the malloc filter won't work then. So I have the feeling, that a simple ptrace approach won't work? — milianw, Nov 25 '14 at 22:14
I'm not sure what you mean by "the malloc filter won't work". `ltrace -e 'malloc+free' -p xxxxx` seems to work just fine here (ltrace 0.7.3 running on linux 3.13.0 / x86_64). — xbug, Nov 29 '14 at 01:42
@xbug: Odd, this is exactly what I tried and it does _not_ work for me. I use the same ltrace version, but Linux 3.17.4-1-ARCH, i.e. from ArchLinux. If I runtime-attach ltrace to any application, it stays silent. If I otoh start the application with ltrace, it works. Any idea what might be going on? — milianw, Nov 30 '14 at 13:20
@xbug: I just build ltrace from sources, and with that version, runtime attachement seems to work. It seems to be extremely slow though which makes it essentially useless for for me. — milianw, Nov 30 '14 at 13:43
@milianw: I do believe I've described a ptrace-based solution [here](http://stackoverflow.com/a/24356162/1475978); are you aware of it? The latter example in that answer replaces an address with a write syscall, in your case you'd replace the initial parts of the target functions with jumps to the interposed functions. The technique is not simple (the hard part is finding the addresses in the target binary to overwrite), and it's very architecture-specific, but after the interposing, there is no extra overhead or speed penalty at all. — Nominal Animal, Dec 02 '14 at 17:03
@NominalAnimal: Nope, I wasn't aware of that. Very interesting. I'll see if I eventually figure out how to code this up to call my own function in malloc, and getting access to both, input arguments and return value... — milianw, Dec 02 '14 at 18:35
I've taken some interest to heaptrack and its statistics-gathering process. What information, exactly, must it record? Is it enough to record the arguments, return value and immediate caller of `malloc`/`free`? Or must the entire backtrace be examined? It is my view that your tool will be fastest (and strategy, different) if you adopt the strategy that amasses on-the-fly the minimum amount of data required to reconstitute the events of interest. Currently I envision patching `malloc`'s first instruction as a `jmp` to an injected page of code, accompanied with a large buffer for call records. — Iwillnotexist Idonotexist, Dec 03 '14 at 15:01
@IwillnotexistIdonotexist: I also see patching `malloc()`, `memalign()`, `posix_memalign()`, `free()` et al. as the way to go. Using ptrace to attach to the target process, and anonymously mapping writable pages, then copying position-independent executable code to that page, is not hard at all. The attaching process can use elf tools and `/proc/PID/maps` to locate the target addresses. This should work for even static binaries (no libdl). Difficult part is to disassemble/duplicate the asm op(s) under the jump instruction -- unless it is a jump instruction itself, of course. — Nominal Animal, Dec 03 '14 at 16:46
@NominalAnimal Precisely; However, I was more concerned with the _correctness_ of the injection since, strictly speaking, it is possible for the attach to occur while a thread is in the prologue of these functions. Worse, the compiled form of these fn's may include a branch backwards to somewhere within this prologue. The former can be solved by having `ptrace()` single-step all threads until they leave the prologues of all functions being injected (this should take no time at all), and then the process is patched. For the second, some primitive binary analysis and relocation will be required. — Iwillnotexist Idonotexist, Dec 03 '14 at 17:08
@IwillnotexistIdonotexist: I've explored ptracing multithreaded processes [in this answer](http://stackoverflow.com/a/18603766), including single-stepping individual threads; it seems robust and straightforward. On x86-64, the prologue (replaced part) is 5 to 13 bytes -- 5 bytes if replacement code is within a 32-bit offset to `%rip`, 13 bytes if an arbitrary 64-bit `pushq %rax; movabs $constant, %rax ; jmp *%rax` sequence is needed. Instruction analysis (those 5-13 bytes) is *nasty*. I'd prefer to mmap complete replacement functions instead. Would that be an acceptable option? — Nominal Animal, Dec 04 '14 at 02:04
@NominalAnimal But what if, hypothetically, the code generated by the compiler branches backwards into the replaced part? Then you'd have a process jump from some branch in `malloc` to where it expected certain instructions, but there it will find either the unexpected, or no instruction at all, and crash. More broadly, how can I be sure that the program will never attempt to execute anything in the replaced prologue ever again? To solve this problem in full generality would require solving the Halting Problem, and is indeed very nasty in its full generality. — Iwillnotexist Idonotexist, Dec 04 '14 at 02:46
@NominalAnimal But: I figure that the first instruction of the prologue is likely to be >=2 bytes. A 2-byte rel8 JMP (stage 1) at prologue could trampoline you to a place with, say, 5 bytes free between two functions or within one. You'd then use a 5-byte rel32 JMP (stage 2) to jump to your true injected code, or to your 13-byte sequence (stage 3) that jumps to anywhere in the 64-bit address space. As for `mmap`-ing complete replacements, I must be sure that a thread has exited the replaced function and will not come back into it other than through the entry point. — Iwillnotexist Idonotexist, Dec 04 '14 at 02:55
@IwillnotexistIdonotexist: Exactly! If the code uses functions from a known C library version, then we can tell the function address ranges (by compiling test binaries against the same library versions); and glibc et al. have public linkage only to the functions themselves, not within them. For robustness, one could single-step each thread until it is out of C library code altogether. However, this would lead to requiring helper code to be compiled against each c library version used... on the other hand, no instruction analysis! — Nominal Animal, Dec 04 '14 at 09:59

Celelibi · Accepted Answer · 2017-04-07T17:05:16.137

Just for the lulz, another solution without ptracing your own process or touching a single line of assembly or playing around with /proc. You only have to load the library in the context of the process and let the magic happen.

The solution I propose is to use the constructor feature (brought from C++ to C by gcc) to run some code when a library is loaded. Then this library just patch the GOT (Global Offset Table) entry for malloc. The GOT stores the real addresses for the library functions so that the name resolution happen only once. To patch the GOT you have to play around with the ELF structures (see man 5 elf). And Linux is kind enough to give you the aux vector (see man 3 getauxval) that tells you where to find in memory the program headers of the current program. However, better interface is provided by dl_iterate_phdr, which is used below.

Here is an example code of library that does exactly this when the init function is called. Although the same could probably be achieved with a gdb script.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dlfcn.h>
#include <sys/auxv.h>
#include <elf.h>
#include <link.h>
#include <sys/mman.h>


struct strtab {
    char *tab;
    ElfW(Xword) size;
};


struct jmpreltab {
    ElfW(Rela) *tab;
    ElfW(Xword) size;
};


struct symtab {
    ElfW(Sym) *tab;
    ElfW(Xword) entsz;
};



/* Backup of the real malloc function */
static void *(*realmalloc)(size_t) = NULL;


/* My local versions of the malloc functions */
static void *mymalloc(size_t size);


/*************/
/* ELF stuff */
/*************/
static const ElfW(Phdr) *get_phdr_dynamic(const ElfW(Phdr) *phdr,
        uint16_t phnum, uint16_t phentsize) {
    int i;

    for (i = 0; i < phnum; i++) {
        if (phdr->p_type == PT_DYNAMIC)
            return phdr;
        phdr = (ElfW(Phdr) *)((char *)phdr + phentsize);
    }

    return NULL;
}



static const ElfW(Dyn) *get_dynentry(ElfW(Addr) base, const ElfW(Phdr) *pdyn,
        uint32_t type) {
    ElfW(Dyn) *dyn;

    for (dyn = (ElfW(Dyn) *)(base + pdyn->p_vaddr); dyn->d_tag; dyn++) {
        if (dyn->d_tag == type)
            return dyn;
    }

    return NULL;
}



static struct jmpreltab get_jmprel(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct jmpreltab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_JMPREL);
    table.tab = (dyn == NULL) ? NULL : (ElfW(Rela) *)dyn->d_un.d_ptr;

    dyn = get_dynentry(base, pdyn, DT_PLTRELSZ);
    table.size = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static struct symtab get_symtab(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct symtab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_SYMTAB);
    table.tab = (dyn == NULL) ? NULL : (ElfW(Sym) *)dyn->d_un.d_ptr;
    dyn = get_dynentry(base, pdyn, DT_SYMENT);
    table.entsz = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static struct strtab get_strtab(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct strtab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_STRTAB);
    table.tab = (dyn == NULL) ? NULL : (char *)dyn->d_un.d_ptr;
    dyn = get_dynentry(base, pdyn, DT_STRSZ);
    table.size = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static void *get_got_entry(ElfW(Addr) base, struct jmpreltab jmprel,
        struct symtab symtab, struct strtab strtab, const char *symname) {

    ElfW(Rela) *rela;
    ElfW(Rela) *relaend;

    relaend = (ElfW(Rela) *)((char *)jmprel.tab + jmprel.size);
    for (rela = jmprel.tab; rela < relaend; rela++) {
        uint32_t relsymidx;
        char *relsymname;
        relsymidx = ELF64_R_SYM(rela->r_info);
        relsymname = strtab.tab + symtab.tab[relsymidx].st_name;

        if (strcmp(symname, relsymname) == 0)
            return (void *)(base + rela->r_offset);
    }

    return NULL;
}



static void patch_got(ElfW(Addr) base, const ElfW(Phdr) *phdr, int16_t phnum,
        int16_t phentsize) {

    const ElfW(Phdr) *dphdr;
    struct jmpreltab jmprel;
    struct symtab symtab;
    struct strtab strtab;
    void *(**mallocgot)(size_t);

    dphdr = get_phdr_dynamic(phdr, phnum, phentsize);
    jmprel = get_jmprel(base, dphdr);
    symtab = get_symtab(base, dphdr);
    strtab = get_strtab(base, dphdr);
    mallocgot = get_got_entry(base, jmprel, symtab, strtab, "malloc");

    /* Replace the pointer with our version. */
    if (mallocgot != NULL) {
        /* Quick & dirty hack for some programs that need it. */
        /* Should check the returned value. */
        void *page = (void *)((intptr_t)mallocgot & ~(0x1000 - 1));
        mprotect(page, 0x1000, PROT_READ | PROT_WRITE);
        *mallocgot = mymalloc;
    }
}



static int callback(struct dl_phdr_info *info, size_t size, void *data) {
    uint16_t phentsize;
    data = data;
    size = size;

    printf("Patching GOT entry of \"%s\"\n", info->dlpi_name);
    phentsize = getauxval(AT_PHENT);
    patch_got(info->dlpi_addr, info->dlpi_phdr, info->dlpi_phnum, phentsize);

    return 0;
}



/*****************/
/* Init function */
/*****************/
__attribute__((constructor)) static void init(void) {
    realmalloc = malloc;
    dl_iterate_phdr(callback, NULL);
}



/*********************************************/
/* Here come the malloc function and sisters */
/*********************************************/
static void *mymalloc(size_t size) {
    printf("hello from my malloc\n");
    return realmalloc(size);
}

And an example program that just loads the library between two malloc calls.

#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>



void loadmymalloc(void) {
    /* Should check return value. */
    dlopen("./mymalloc.so", RTLD_LAZY);
}



int main(void) {
    void *ptr;

    ptr = malloc(42);
    printf("malloc returned: %p\n", ptr);

    loadmymalloc();

    ptr = malloc(42);
    printf("malloc returned: %p\n", ptr);

    return EXIT_SUCCESS;
}

The call to mprotect is usually useless. However I found that gvim (which is compiled as a shared object) needs it. If you also want to catch the references to malloc as pointers (which may allow to later call the real function and bypass yours), you can apply the very same process to the symbol table pointed to by the DT_RELA dynamic entry.

If the constructor feature is not available for you, all you have to do is resolve the init symbol from the newly loaded library and call it.

Note that you may also want to replace dlopen so that libraries loaded after yours gets patched as well. Which may happen if you load your library quite early or if the application has dynamically loaded plugins.

This does look promising. And indeed, it works for the simple test you added. But in more complicated scenarios, e.g. when I attach to a bigger application via gdb and then `call (void) dlopen("/tmp/libinject.so", 0x0001)` there, I see that the lib gets initialized, but fails to find the malloc address. When I try it with `kwrite` e.g., the symbols it finds are `__libc_start_main`, `__gmon_start__`, `kdemain`. — milianw, Dec 05 '14 at 18:09
BTW, if I have time, I definitely plan to look more into your code. This looks extremely promising. Can I still "donate" you with bounty points if this works out in the end? If so, I'd be more than willing to if this works out in the end. — milianw, Dec 05 '14 at 18:11
Is there a reason you use `getauxval` instead of `dl_iterate_phdr` from `link.h`? — milianw, Dec 05 '14 at 21:43
I've tried to iterate over all dynamic sections with `dl_iterate_phdr`, in the hope to make this overloading work even in apps that load in shared libraries, but can't get it to work... My code lives here (note: C++11 syntax) https://paste.kde.org/ptobkcije <-- it crashes when trying to overwrite the found malloc address, even though I check for readable and writable dynamic sections via `p_flags`... Any idea what I'm doing wrong? Do I need to call `mprotect` somewhere? — milianw, Dec 05 '14 at 22:54
Woha! I got it, for shared libraries, I need to take the `dlpi_addr` offset into account! Both, when casting the `ElfW(Dyn)` from `p_vaddr`, as well as when writing the symbol `rela->r_offset`, I need to add the `dlpi_addr` offset, and magically it works. So many thanks Celelibi, without your help, I would never found a way to write this up! How can I show you my gratitude? I've now accepted your answer, but the original bounty already timed out. Can I give you another bounty? Or anything else? Many thanks, really! Here's a link to my latest code: http://bit.ly/1Axmk4Y — milianw, Dec 05 '14 at 23:53
The main reason I didn't use `ld_iterate_phdr` is because I didn't know this function. :) I modified the code in my answer to use it and to apply the patch to all the loaded shared objects. You usually don't need to use `mprotect`, but some programs (like vim, compiled as a shared object itself) may require it. — Celelibi, Dec 06 '14 at 10:51
I read your C++ code, and I don't know why you need to skip `ld-linux-x86-64.so` and your own library. It works for me. I don't know either why you search for a `PT_DYNAMIC` segment with read and write permission. Normally, all that matters are the `PT_LOAD` segments. And I would suggest to put all your functions `static`. — Celelibi, Dec 06 '14 at 11:26
Without the mprotect, I get the crashes on ld-linux etc. pp. That helps, thanks! I could also get rid of the flag checks then. Note that all my functions are static, as they are in an anonymous namespace. I'll reward you with the additional bounty tomorrow - thanks again Celelibi! — milianw, Dec 07 '14 at 15:08
@milianw I completed my answer to also tell to overload `dlopen` if needed. Because the program may load libraries afterward. — Celelibi, Dec 08 '14 at 03:34
Yep, I also think I'll need to add some update code for dlopen. Many thanks again Celelibi. I rewarded you with a bounty, as you might have seen. Have fun with the imaginary reputation ;-) I hope you'll help more people the way you did help me. Really awesome! — milianw, Dec 09 '14 at 01:20
Just in case getauxval() isn't in your glibc (it was added in glibc 2.16), here's an alternative that gets the AT_PHENT that should work on older platforms: ` struct PVTM_AUXV{ unsigned long type; unsigned long val; }; unsigned long int pvtmGetProgHeaderEnt(){ struct PVTM_AUXV auxv; int fd = open("/proc/self/auxv", O_RDONLY); if (fd != -1){ do{ if (read(fd, &auxv, sizeof(auxv)) == sizeof(auxv)){ if (auxv.type == 4 /* AT_PHENT */) { close (fd); return auxv.val; }} else{ close(fd); return 0; } } while (1); } return 0; }` — Leo, May 03 '16 at 11:13

score 4 · Answer 2 · answered Nov 28 '14 at 12:19

4

This can not be done without tweaking with assembler a bit. Basically, you will have to do what gdb and ltrace do: find malloc and friends virtual addresses in the process image and put breakpoints at their entry. This process usually involves temporary rewriting the executable code, as you need to replace normal instructions with "trap" ones (such as int 3 on x86).

If you want to avoid doing this yourself, there exists linkable wrapper around gdb (libgdb) or you can build ltrace as a library (libltrace). As ltrace is much smaller, and the library variety of it is available out of the box, it will probably allow you to do what you want at lower effort.

For example, here's the best part of the "main.c" file from the ltrace package:

int
main(int argc, char *argv[]) {
    ltrace_init(argc, argv);

 /*
    ltrace_add_callback(callback_call, EVENT_SYSCALL);
    ltrace_add_callback(callback_ret, EVENT_SYSRET);
    ltrace_add_callback(endcallback, EVENT_EXIT);

    But you would probably need EVENT_LIBCALL and EVENT_LIBRET
 */

    ltrace_main();
    return 0;
}

http://anonscm.debian.org/cgit/collab-maint/ltrace.git/tree/?id=0.7.3

answered Nov 28 '14 at 12:19

oakad

6,945
1
22
31

Thanks for the hints. LTrace seems to have an extremely high overhead though. So high, that it becomes unpractical for me to use it. I may need to wait for the perf subsystem to support native "scripts" which I could then use to hookup to a custom userspace breakpoint... – milianw Nov 30 '14 at 13:47
1

You will end up in the same place. Execution tracing of any kind is rather slow and even hardware breakpoints can slow things down very considerably. To my opinion, the only reasonably fast approach will be to scan all the modules loaded for the process and then, using their disk images as references, redo the dynamic linking process for symbols of interest (so instead of link to malloc, process image would now link to accounting stub, forwarding to malloc). This is not difficult per se, but the effort to get it right may be considerable. – oakad Nov 30 '14 at 14:21
So ltrace, or similarly GDB, cannot just do the rewrite for me once and then "detach"? I mean after malloc/free where rewritten in libc, I'd expect to have no further overhead, besides the additional jump and what I add in my own tool. Why is that not the case? – milianw Dec 01 '14 at 09:57
The issue of "hot" dll injection is mostly of interest to people developing exploits, so this stuff is not very visible publicly. Here's an example of "hot" symbol injector: https://github.com/ice799/injectso64 – oakad Dec 01 '14 at 10:52
Correct me if I'm wrong, but doesn't injectso64 "just" inject a shared library? That can be (trivially) accomplished using a small GDB script as well, by calling dlopen manually. Or does injectso64 also rewrite functions? That's what I'm really interested about. Maybe LTTng is what I'm looking for? – milianw Dec 01 '14 at 13:47
Hm, no LTTng seems to be about predefined trace points. I think the holy grail would be a userspace API to get access to UProbes... – milianw Dec 01 '14 at 13:58
"injectos" will rewrite the relocation entry in the running process so it will call your stuff instead of the shared library it was calling before. That's how exploits operate, gdb is not going to do anything like that. – oakad Dec 01 '14 at 15:46
I fail to see how it is doing that - can you shed some light on that? The examples all seem to rely on the `_init()` function being called when the shared library is loaded. So I think what I need to understand is how `inject_code()` in `inject.c` can be rewritten to overload `malloc` or similar to call a custom function of mine instead. – milianw Dec 01 '14 at 16:05
@milianw `inject_code()` does just enough to load your `.so` into the target process address space. Once you're in there, you're basically done since you can have the target do anything on your behalf. "injectso" has its shortcomings, but that's probably the closest you can get to having "an alternative to `LD_PRELOAD` [...] attach to a running process". Provided you also inject `ld.so` into the target if it's not already there, you may even be able to use the same `.so` file both with `LD_PRELOAD` and with "injectso". – xbug Dec 01 '14 at 16:21
(correction: "provided that you inject `libdl.so` ...") – xbug Dec 01 '14 at 16:29
@xbug: See my original question, getting into the process is easy (GDB attach and call dlopen on your .so, done). What I need to do though is overwriting the symbols of e.g. `malloc` to call my function and then delegate to the original function. With `LD_PRELOAD` the linker handles that for me. Now I'm looking for a way to do this manually when attaching to a process. If `injectos` can also do this, somehow, I'd be very interested. My knowledge of assembly is apparently not enough to understand what it is doing. – milianw Dec 01 '14 at 16:30
@millianw You're right, the injectso guys clearly omitted the bit of the code which will walk the PLT table and replace arbitrary symbols at will. They probably assumed that a worthy hacker will be able to do it on his own, like this guy here: http://shadowwhowalks.blogspot.com.au/2013/01/android-hacking-hooking-system.html – oakad Dec 01 '14 at 18:07
Haha, yeah apparently I need to learn quite a bit more before I can implement this. Many thanks so far oakad, I've upvoted you already. Should noone give me a "better" answer in the next three days, you'll get the bounty and I'll accept your answer as well. Thanks. – milianw Dec 02 '14 at 12:02

Overload symbols of running process (LD_PRELOAD attachment)

2 Answers2

Linked