when should we care about cache missing?

Question

I want to explain my question through a practical problem I met in my project.

I am writing a c library( which behaves like a programmable vi editor), and i plan to provide a series of APIs ( more than 20 in total ):

void vi_dw(struct vi *vi);
void vi_de(struct vi *vi);
void vi_d0(struct vi *vi);
void vi_d$(struct vi *vi);
...
void vi_df(struct vi *, char target);
void vi_dd(struct vi *vi);

These APIs do not perform core operations, they are just wrappers. For example, I can implement vi_de() like this:

void vi_de(struct vi *vi){
    vi_v(vi);  //enter visual mode
    vi_e(vi);  //press key 'e'
    vi_d(vi);  //press key 'd'
}

However, if the wrapper is as simple as such, I have to write more than 20 similar wrapper functions.
So, I consider implementing more complex wrappers to reduce the amount:

void vi_d_move(struct vi *vi, vi_move_func_t move){
   vi_v(vi);
   move(vi);
   vi_d(vi);
}
static inline void vi_dw(struct vi *vi){
    vi_d_move(vi, vi_w);
}
static inline void vi_de(struct vi *vi){
    vi_d_move(vi, vi_e);
}
...

The function vi_d_move() is a better wrapper function, he can convert a part of similar move operation to APIs, but not all, like vi_f(), which need another wrapper with a third argument char target .

I finished explaining the example picked from my project.
The pseudo code above is simper than real case, but is enough to show that:
The more complex the wrapper is, the less wrappers we need, and the slower they will be.(they will become more indirect or need to consider more conditions).

There are two extremes:

use only one wrapper but complex enough to adopt all move operations and convert them into corresponding APIs.
use more than twenty small and simple wrappers. one wrapper is one API.

For case 1, the wrapper itself is slow, but it has more chance resident in cache, because it is often executed(all APIs share it). It's a slow but hot path.

For case 2, these wrappers are simple and fast, but has less chance resident in cache. At least, for any API first time called, a cache miss will happen.(CPU need to fetch instructions from memory, but not L1, L2).

Currently, I implemented five wrappers, each of them are relatively simple and fast. this seems to be a balance, but just seems. I chose five just because I felt the move operation can be divided into five groups naturally. I have no idea how to evaluate it, I don't mean a profiler, I mean, in theory, what main factors should be considered in such case?

In the post end, I want to add more detail for these APIs:

These APIs need to be fast. Because this library is designed as a high performance virtual editor. The delete/copy/paste operation is designed to approach the bare C code.
A user program based on this library seldom calls all these APIs, only parts of them, and usually no more than 10 times for each.
In real case, the size of these simple wrappers are about 80 bytes each, and will be no more than 160 bytes even merged into a single complex one. (but will introduce more if-else branches).

4, As with the situation the library is used, I will take lua-shell as example(a little off-topic, but some friends want to know why I so care its performance):

lua-shell is a *nix shell which uses lua as its script. Its command execution unit(which do forks(), execute()..) is just a C module registered into the lua state machine.

Lua-shell treats everything as lua .

So, When user input:

local files = `ls -la`

And press Enter. The string input is first sent to lua-shell's preprocessor————which convert mixed-syntax to pure lua code:

local file = run_command("ls -la")

run_command() is the entry of lua-shell's command execution unit, which, I said before, is a C module.

We can talk about libvi now. lua-shell's preprocessor is the first user of the library I am writing. Here is its relative codes(pseudo):

#include"vi.h"
vi_loadstr("local files = `ls -la`");
vi_f(vi, '`');
vi_x(vi);
vi_i(vi, "run_command(\"");
vi_f(vi, '`');
vi_x(vi);
vi_a(" \") ");

The code above is parts of luashell's preprocessor implementation. After generating the pure lua code, he feeds it to Lua State Machine and run it.

The shell user is sensitive to the time interval between Enter and a new prompt, and in most case lua-shell needs preprocess script with larger size and more complicate mixed-syntax.

This is a typical situation where libvi is used.

Are you coding a real [source code editor](https://en.wikipedia.org/wiki/Source_code_editor) (then performance don't matter as much) or what is your library really doing? Why is it so performance critical? Did you *benchmark it* (with optimizations enabled at compile time) and *profile it* ? Please **edit your question** to explain more the real motivation and what application do you have in mind. — Basile Starynkevitch, Jul 05 '17 at 09:51
Organise your code in the way that makes it most understandable and easy to maintain. Unless your code is executed very frequently e.g. 60 times a second because it has to be done for every graphics frame, the time you save over the whole life of the program by optimising for the instruction cache is likely to be less *in total* than the time you spend doing the optimising and subsequent extra maintenance time. For example, if you shave a microsecond off by spending an hour optimising, your code has to execute 3.6 billion times to get that hour back. — JeremyP, Jul 05 '17 at 10:13
What @JeremyP says, plus it's actually worse than that. In order to avoid an optimization attempt having a negative effect in future, you may well have to retest with every new version of the compiler, OS, hardware environment etc, so Jeremy's example 'standard hour' of design/test/debug/measure has to be continually repeated :( — ThingyWotsit, Jul 05 '17 at 10:41
Why is this tagged C++? `void vi_dw(struct vi *vi);` is clearly C, in C++ it would have been `void vi::dw( )`. — MSalters, Jul 05 '17 at 11:52

Basile Starynkevitch · Answer 1 · 2017-07-05T11:22:19.190

I won't care that much about cache misses (especially in your case), unless your benchmarks (with compiler optimizations enabled, i.e. compile with gcc -O2 -mtune=native if using GCC....) indicate that they matter.

If performances matters that much, enable more optimizations (perhaps compiling and linking your entire application or library with gcc -flto -O2 -mtune=native that is with link-time optimizations), and hand-optimize only what is critical. You should trust your optimizing compiler.

If you are in the design phase, consider perhaps making your application multi-threaded or somehow concurrent and parallel. With care, this could speedup it more than cache optimizations.

It is unclear what your library is about and what are your design goals. A possibility to add flexibility might be embed some interpreter (like lua or guile or python, etc...) in your application, hence configuring it thru scripts. In many cases, such an embedding could be fast enough (especially when the application specific primitives are of high enough level). Another (more complex) possibility is to provide metaprogramming abilities perhaps thru some JIT compiling library like libjit or libgccjit (so you would sort-of "compile" user scripts into dynamically produced machine code).

BTW, your question seems to focus on instruction cache misses. I would believe that data cache misses are more important (and less optimizable by the compiler), and that is why you would prefer e.g. vectors to linked lists (and more generally care about low-level data structures, focusing on using sequential -or cache-friendly- accesses)

^{(you could find a good video by Herb Sutter which explains that last point; I forgot the reference)}

In some very specific cases, with recent GCC or Clang, adding a few __builtin_prefetch might slightly improve performance (by decreasing cache misses), but it could also harm it significantly, so I don't recommend using it in general, but see this.

Thanks. Maybe what I want to know is not how to optimize it, but how others will handle such case. Your " I won't care" is also a good reply for me. PS: I updated this library's application situations in *item 4* in the post end. — weiweishuo, Jul 05 '17 at 10:57

when should we care about cache missing?

1 Answers1