Readability should, in general, always come first, and you can pretty much regard this as a "last resort" kind of optimisation which will not buy you a significant performance gain.
Today's CPUs are caching the instructions as well as the data. In general, you should optimise the layout of the data and the memory access patterns, but the way in which instructions are arranged also matters for the utilisation of the instruction cache.
Calling a non-inlined function is in fact an unconditional jump, much like a jmp
instruction. This jump makes the CPU start fetching instructions from another (possibly far) location in memory. If this new location isn't found in the instruction cache, the CPU will stall until the corresponding memory is brought there. In theory, if the code contains no jumps and branches, the CPU could prefetch instructions as aggressively as possible.
Also, you never really know how far is "too far". Jumping a few kilobytes forwards or backwards might well be a cache hit, since the usual instruction cache today is about 32 kilobytes.
It's a very tricky optimisation to do right, and I would advise you to look at your data layout and memory access patterns first.
The other concern is the overhead of passing the arguments on the stack or in registers. With today's CPUs this is less of a problem, since the whole stack is usually "hot" in the data cache, and register renaming can even eliminate register-to-register moves to a no-op.