How come optimization makes a (potentially) complicated code beat a really small compact one in iterating through a hashtable?

Question

Background

Consider the hashtable with closed addressing/separate chaining:

// After inserting 16

|*  0 *| --> {499996, 4}
|*  1 *| --> {499994, 6}
|*  2 *|
|*  3 *| --> {499987, 13} --> {499998, 2}
|*  4 *|
|*  5 *| --> {499984, 16} --> {499986, 14}
|*  6 *|
|*  7 *| --> {499999, 1}
|*  8 *| --> {499985, 15} --> {499997, 3}
|*  9 *| --> {499989, 11} --> {499995, 5}
|* 10 *| --> {499988, 12} --> {500000, 0}
|* 11 *| --> {499993, 7}
|* 12 *|
|* 13 *| --> {499990, 10}
|* 14 *| --> {499991, 9} --> {499992, 8}
|* 15 *|

Typically, we can have two different schemes to link through all the nodes. One is to link the last node of the current bucket to the first node of the next non-empty bucket. See the above picture and imagine taht there are links among these buckets. Another is to make all the nodes look exactly the same as the above picture (by which I mean to link the tail nodes with null pointers).

The second approach will have some more work to obtain the successor of the current node. It may need to scan a lot of (empty) buckets to get its successor. Sounds awful, huh? Actually not. By amortized analysis (assume the max_load_factor==1 ), ideally, we will have each node in a (different) bucket exactly. So it can be efficient.

How about the first one? The bad news is that it can suffer from very terrible insertion overhead. Now imagine you have a very big hashtable, tab[i] and tab[j] are two adjacent (non-empty) buckets. The horrible thing is that you have a very big difference |i-j|, and then an item whose hash code k is between i and j is needed to be inserted there. We will need to scan all the way down from k to j in order to update the links of the three nodes: the last node of bucket i, the newly inserted node, the first node of bucket j (you can think of this process as the insertion of a doubly-linked list). And if we constantly perform this pathological operation, j-1, i+1, j-2, i+2,..., things can get really bad.

The Problem

After building a hashtable, we iterate through all the elements, which do you think is faster, the all-linked-up one (1st case) or the only-in-bucket-linked one (2nd case)?

Here is a piece of test code and the whole code can be found here.

int main()
{
#if   defined(USE_HASHMAP)
    mySymbolTable::HashMap<int, int> mp{};
#elif defined(USE_HASHMAP2)
    mySymbolTable::alternative::HashMap<int, int> mp{};
#else
    std::unordered_map<int, int> mp{};
#endif

    constexpr int N = 1'000'000;
    int k = 0;

    for(int i=0; i < N; ++i)
        mp[N-i] = i;

    auto t1 = std::clock();
    for(auto it=mp.begin(); it!=mp.end(); ++it)
        ++k;
    auto t2= std::clock();

    auto iteration_time_elapsed = (t2 - t1) / (double)1'000;
    std::cout << "Iterating through " << k << " items used "
              << iteration_time_elapsed << "ms\n";
    return 0;
}

Essentially, they run the following code respectively

_self& operator++() {
    _ptr = _ptr->_next;
    return *this;
}

vs

_self& operator++() {
    auto next = _table->next_entry(_ptr, _bucket_index);
    _ptr = next.first;
    _bucket_index = next.second;
    return *this;
}

std::pair<node_ptr, size_t> next_entry(node_ptr x, size_t bucket_index) const {
    assert(x != nullptr && "cannot increment end() iterator");
    if (x->_next) return { x->_next, bucket_index };
    size_t i = bucket_index + 1;
    for (; i < _hashtable.size(); ++i) {
        if (_hashtable[i]) return { _hashtable[i], i };
    }
    return { nullptr, i };
}

The bizarre thing is that in debug mode (-g flag), the 1st case beats the 2nd by about one fifth (t1 = 82% * t2), which is expected; while if optimization is enabled (-O3 flag), it’s a whole different story. The summary tables are given below (or you can view the screenshot here).

test_d	test1_d	test2_d	test_int_d	test_int1_d¹	test_int2_d	test_int_hash_d	test_int_hash1_d	test_int_hash2_d
114.993ms	144.856ms	175.616ms	27.138ms		58.631ms	100.589ms	138.219ms	168.292ms

test	test1	test2	test_int	test_int1	test_int2	test_int_hash	test_int_hash1	test_int_hash2
108.126ms	141.825ms	32.269ms	5.564ms	5.777ms	5.032ms	93.445ms	139.476ms	31.251ms

Here, _d suffix means debug version, *1 suffix means myst::HashMap (1st case), *2 suffix means myst::alternative::HashMap (2nd case), and no suffix suggests we are using std::unordered_map.

We can see from the results that optimization -O3 barely boosts the performance of the 1st case (std::unordered_map and myst::HashMap) except the test_int cases, but rather make a significant boost in the 2nd case (myst::alternative::HashMap). Given the 2nd is a little bit more complicated, it is therefore expected there will be more space for the 2nd case to optimize, for example, inlining. Still, I don’t think it can beat the 1st small code. How do we explain this weird phenomenon?

(In my understanding, small code means fast, think about cache, and technically speaking, 2nd case should never be faster than the 1st one.)

Any comments are appreciated.

^{¹ It took a painfully long time even for optimized version to build the table. View the results here. It seemed to have something to do with the hash function, I used another hash function found on SO and it seemed to work well.}

there isn't a direct correlation between code size and speed, making code bigger (e.g. loop unrolling) can make it faster — Alan Birtles, Mar 03 '22 at 12:18
That's not much different from your [previous, now deleted question](https://stackoverflow.com/questions/71252286). Your idea that _"small code means fast"_ is still wrong. E.g. the code for bubble sort is _much_ smaller than for quicksort, but quicksort is much faster. And yes, optimization can do crazy things. — Lukas-T, Mar 08 '22 at 07:07
@churill Thanks for your commenting. Sorry that I didn't make it clear: "small code means fast" of course I meant, under the same algorithm complexity or using the same algorithm. It wouldn't make sense to associate the code size with performance on different algorithms. But here these two hash tables were constructed using the same algorithms and if you debug them (use `mp.print()`) you can see same structure. And we're here to compare the iterating speed of these two node-linking schemes. And the results turned out quite surprising in optimized mode. — Mark Taylor, Mar 08 '22 at 15:15
@churill What optimization technique(s) does the compiler use to make this happen really amazed me. And I'm seeking for explanations. (https://en.wikipedia.org/wiki/Optimizing_compiler) At first I was thinking it was possibly because of CPU instruction pipelining, instruction reordering/rescheduling etc., maybe. But wait a minute, I only have "one" simple instruction (`_ptr = _ptr->_next;`) in the 1st case while at least a few more instructions in the 2nd. I just couldn't figure it out. — Mark Taylor, Mar 08 '22 at 15:18
The compiler sucking at optimizing simple instructions (though doesn't have that much space as compared to the 2nd) is really lame. I believe there must be a rational explanation for it. BTW, 1st-case iterator only contains a node pointer whereas the 2nd contains a table pointer, a bucket index, and a node pointer. I'm not sure if this is the problem, the fat iterator shines :) — Mark Taylor, Mar 08 '22 at 15:42
@AlanBirtles Thanks for pointing out this. But I didn't ever want/intend to associate code size with speed, maybe it was because of the title (a bit confusing, but I haven't figured out a better one. Hashtable-iterating schemes comparisons??). Anyway, all I was/am confused about is why all-nodes-linked-up iteration be slower than each-bucket-linked-up iteration. How come? How possible? I also made a few other comments which may give some (helpful) information. Please check them out. Thank ya! If you have any further confusion/question(s), please comment. — Mark Taylor, Mar 09 '22 at 06:29
@MarkTaylor Yes, of course, that example was a bit exaggerated :) Maybe you could also fine tune your meassurement. `clock()` isn't exactly reliable. Maybe [this question](https://stackoverflow.com/questions/14337278/precise-time-measurement) points out some alternatives. There are also some other effects like Cache/Cpu warmup, so you could try to iterate twice, but only meassure the second one. — Lukas-T, Mar 09 '22 at 07:23
@churill Thanks for your advices. I just tested clocking 2nd, 3rd iteration, used a different clocking system `chrono::high_resolution_clock`, still it didn't change anything. BTW, IMHO, I think the most accurate way to clock a code snippet is to use a system call. And a few days before while I was searching for fast string-sorting algorithms, I found https://github.com/rantala/string-sorting/blob/master/src/util/timing.c this may be more accurate for clocking. But I haven't tested it yet. — Mark Taylor, Mar 09 '22 at 08:13

How come optimization makes a (potentially) complicated code beat a really small compact one in iterating through a hashtable?

Background

The Problem

0 Answers0