If a single call is too short to measure, then why do you care how long it takes?
I'm being a bit facetious, but if you're on Intel Linux, and your process is pinned to one core, you can read the CPU's timestamp counter (TSC), which is the highest resolution tick you can get. In recent Intel CPUs it ticks very solidly at the nominal CPU frequency independent of the actual frequency (which varies wildly). If you Google for "rdtsc", you'll find several implementations for a rdtsc() function that you can just call. You could then try something like:
uint64_t tic, elapsed[10000];
for(i=0; i<10000; i++) {
tic = rdtsc()
my_func()
elapsed[i] = tic - rdtsc()
}
That might get you within shouting distance of a maybe kinda/sorta semi-valid values for individual function calls, from which you can then produce whatever statistics you want (mean/mode/median/variance/std.dev.). The validity of this is seriously open to question, but it's the best that can be done with anything like your method. I'd be much more inclined to run the whole application under perf record
and then use perf report
to see where the cycles are being expended and focus on that.