You can do this before runtime by static analysis, or at runtime by dynamic analysis.
You have only two ways to do this at runtime, and they both amount to instrumenting code:
Instrument the source code
Modify ("instrument") the source code so it keeps track of what you want, e.g., modify
every call site to pass an additional argument effectively containing the calling function name, and every function entry to record the entry to that function (e.g., effectively the called function name) and the calling function name. The instrumentation process needs to build a cross-reference from points that are instrumented, back to source file and line to enable later reporting.
You can do this in a completely automated way by using a Program Transformation System (PTS), which can parse source code to make ASTs, modify the ASTs, and then regenerate source. Often a PTS will let you write the changes as source-level patterns of the form, "if you see this, replace it by that". (No, regexes can't do this).
An extended example of how do this can be found in my technical paper, Branch Coverage for Arbitrary Languages. The paper shows discusses what is in a PTS, and then shows typical instrumentation transformations.
Getting this done "right" for a language like C can be difficult, because you need as a foundation, a full C parser, which is a major technical feat in its own right. Best to get such a beast in completed form (some of the PTS have this, consider Concinnelle or DMS) or you'll never get around to doing the instrumentation part. (Another answer suggests that GCC has much of this instrumentation functionality built-in).
Done reasonably, the execution cost of the additional instrumentation is pretty modest (10-30%). This concept has been used to implement test coverage tools and timing profilers, which operate by tracking the dynamic set of calls, which seems to be exactly what you want.
Instrument the object code
Modify the execution engine that runs the code to keep track of the details you want. Given that you want to do this with C programs, the execution engine is essentially the underlying instruction set.
Since you can't reasonably modify the processor chip itself, you'll have to either:
modify the object code (using object-code equivalents of program transformations, see preceding paragraph for concepts, or see tools like PIN and Valgrind for details of implementation),
or write some kind of clever instruction set interpreter that collects the information it needs as it encounters call-instructions (this is likely to be rather slow, because of its interpretive nature).
Implementing these for real instruction sets is complicated by the instruction sets themselves (Intel's X86 instruction set is huge), object file formats, etc. Don't expect this path to be easier than the source instrumentation path; just expect it to have a different set of problems.
One standard problem is knowing what to instrument; the object code space is full of code (libraries, system routines, ...) that aren't part of the application of interest, and thus aren't interesting to collect test coverage data for. For this, you need a symbol table ( pairs) to tell what is application code, and what is not. You also need to be able to trace from the name to a point in the original source file(s),
in order to report the results.
One other difficulty with object code instrumentation is that it is difficult to tell if a function foo, which has been inlined, has been executed ("covered"). The source instrumentation scheme does not have this problem, because foo gets instrumented before it gets compiled/inlined, so the instrumentation gets inlined too.
Threads
Handling threads is an orthogonal issue. At each place where you record a fact X-calls-Y, you simply attach the thread ID (or its equivalent) T, obtained from the OS, to produce X-calls-Y-in-T. It isn't clear from your question why you want to do this.
Static Analysis
You could decide to skip all the hassles of runtime implementation, and build/get a static analyzer that can produce a call graph. Getting of these for C might be possible using Clang or DMS. Building it yourself is likely difficult; now you need a full C parser, complete data flow and points-to analysis, and call-graph construction. The details won't fit in this paragraph.