Why does random extra code improve performance?

Question

Struct Node {
    Node *N[SIZE];
    int value;
};

struct Trie {
    Node *root;

    Node* findNode(Key *key) {
        Node *C = &root;
        char u;
        while (1) {
            u = key->next();
            if (u < 0) return C;
         // if (C->N[0] == C->N[0]); // this line will speed up execution significantly
            C = C->N[u];
            if (C == 0) return 0;
        }
    }
    void addNode(Key *key, int value){...};
};

In this implementation of Prefix Tree (aka Trie) I found out that 90% of findNode() execution time is taken by a single operation C=C->N[u];

In my attempt to speed up this code, I randomly added the line that is commented in the snipped above, and code became 30% faster! Why is that?

UPDATE

Here is complete program.

#include "stdio.h"
#include "sys/time.h"

long time1000() {
  timeval val;
  gettimeofday(&val, 0);
  val.tv_sec &= 0xffff;
  return val.tv_sec * 1000 + val.tv_usec / 1000;
}

struct BitScanner {
    void *p; 
    int count, pos;
    BitScanner (void *p, int count) {
        this->p = p;
        this->count = count;
        pos = 0;
    }
    int next() {
        int bpos = pos >> 1;
        if (bpos >= count) return -1;
        unsigned char b = ((unsigned char*)p)[bpos];
        if (pos++ & 1) return (b >>= 4);
        return b & 0xf;
    }

};

struct Node {
    Node *N[16];
    __int64_t value;
    Node() : N(), value(-1) { }
};

struct Trie16 {
    Node root;

    bool add(void *key, int count, __int64_t value) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            int u = B.next();
            if (u < 0) {
                if (C->value == -1) {
                    C->value = value;
                    return true; // value added
                }
                C->value = value;
                return false; // value replaced
            }
            Node *Q = C->N[u];
            if (Q) {
                C = Q;
            } else {
                C = C->N[u] = new Node;
            }
        }
    }

    Node* findNode(void *key, int count) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            char u = B.next();
            if (u < 0) return C;
//          if (C->N[0] == C->N[1]);
            C = C->N[0+u];
            if (C == 0) return 0;
        }
    }
};

int main() {
    int T = time1000();
    Trie16 trie;
    __int64_t STEPS = 100000, STEP = 500000000, key;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        bool ok = trie.add(&key, 8, key+222);
    }
    printf("insert time:%i\n",time1000() - T); T = time1000();
    int err = 0;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        Node *N = trie.findNode(&key, 8);
        if (N==0 || N->value != key+222) err++;
    }
    printf("find time:%i\n",time1000() - T); T = time1000();
    printf("errors:%i\n", err);
}

What compile flags did you use? Also did you make multiple tests, or just one? — , Jul 16 '15 at 08:52
Memory access speed is a common bottleneck these days where everything else is fast. Beware of those `->`, they can be very expensive. — Kerrek SB, Jul 16 '15 at 08:57
@Aleksandar, I did multiple tests, hundreds in fact, this amused me and captured my attention for hours. I used both clang and gcc with both -O0 and -O3. — exebook, Jul 16 '15 at 08:57
@exebook Where did you get the idea to add code in your attempt to gain speed? — JorenHeit, Jul 16 '15 at 09:01
You should post an [MCVE](http://stackoverflow.com/help/mcve). — juanchopanza, Jul 16 '15 at 09:04
@JorenHeit, I was debugging and wanted to count number of pointer access, and I found that the counter added for debugging improves performance significantly. — exebook, Jul 16 '15 at 09:04
I think that improvement is not related to `dummy++`, but to `if (C->N[0] == C->N[1])`. I guess that this check causes the caches are used and data for `C = C->N[u]`are read in the cache immediately. — LPs, Jul 16 '15 at 09:11
With small programs this is common behavior because of the instruction cache having different content. My suggestion is to compile and save both binaries and run them one after the other and then reverse the order, and compare the times. — Cobusve, Jul 16 '15 at 09:12
Maybe the CPU starts doing prefetch of `N` when it sees it being accessed multiple times? — Alexander Balabin, Jul 16 '15 at 09:12
I just found out that `dummy++` part is redundant, `if (C->N[0] == C->N[1]);` alone is enough to cause the effect. — exebook, Jul 16 '15 at 09:32
Could you try replacing the "random code" by `__builtin_prefetch` on gcc? https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html — erenon, Jul 16 '15 at 09:53
@erenon, I replaced "random code" with `__builtin_prefetch(C->N[u]);`, and it does the same effect. In fact even 5% faster. — exebook, Jul 16 '15 at 10:05
@erenon, right, but I am not sure should I accept Alexander Balabin's answer. — exebook, Jul 16 '15 at 11:38
The key do-nothing line is different between the original post and the update. In the original post, you wrote `if (C->N[0] == C->N[0]);`, which tests a tautology and then does nothing. I'd expect the compiler to optimize that out (assuming no side effects from operator overloads). On the other hand, `if (C->N[0] == C->N[1]);` is not a tautology and could affect caching, branch prediction, speculative execution, etc. — Adrian McCarthy, Jul 16 '15 at 17:35

score 7 · Accepted Answer · answered Jul 16 '15 at 09:27

This is largely a guess but from what I read about CPU data prefetcher it would only prefetch if it sees multiple access to the same memory location and that access matches prefetch triggers, for example looks like scanning. In your case if there is only single access to C->N the prefetcher would not be interested, however if there are multiple and it can predict that the later access is further into the same bit of memory that can make it to prefetch more than one cache line.

If the above was happening then C->N[u] would not have to wait for memory to arrive from RAM therefore would be faster.

Correct! As I commented the magic is made by `if (C->N[0] == C->N[1])` not by `dummy++;` — LPs, Jul 16 '15 at 09:38

score 1 · Answer 2 · edited May 23 '17 at 12:31

It looks like what you are doing is preventing processor stalls by delaying the execution of code until the data is available locally.

Doing it this way is very error prone unlikely to continue working consistently. The better way is to get the compiler to do this. By default most compilers generate code for a generic processor family. BUT if you look at the available flags you can usually find flags for specifying your specific processor so it can generate more specific code (like pre-fetches and stall code).

See: GCC: how is march different from mtune? the second answer goes into some detail: https://stackoverflow.com/a/23267520/14065

FTR: on gcc and clang look for `-march=native`. – erenon Jul 16 '15 at 15:05 — erenon, Jul 16 '15 at 15:05

score -2 · Answer 3 · answered Jul 16 '15 at 09:26

-2

Since each write operation is costly than the read. Here If you see that, C = C->N[u]; it means CPU is executing write in each iteration for the variable C. But when you perform if (C->N[0] == C->N[1]) dummy++; write on dummy is executed only if C->N[0] == C->N[1]. So you have save many write instructions of CPU by using if condition.

answered Jul 16 '15 at 09:26

userNishant

132
9

If you talk about CPU instructions, dummy++ and C = C->N[u]; will make same sense. – userNishant Jul 16 '15 at 09:50

Why does random extra code improve performance?

3 Answers3