0

In the constructor I have piece of code which does the following:

code snippet 1

for (unsigned i=0; i<n; ++i) { 
    auto xIdIt = _originIdSet.find(xIds[i]);
    auto yIdIt = _originIdSet.find(yIds[i]);
    if (xIdIt == _originIdSet.end()) { _originIdSet.insert(xIds[i]); }
    if (yIdIt == _originIdSet.end()) { _originIdSet.insert(yIds[i]); }
}

_originIdSet is of the type std::unordered_set<uint32_t>
xIds and yIds is of the type std::vector<uint32_t>

xIds and yIds contain lots of duplicate entries, e.g.

xIds = {1,2,1,2,3,5,....} yIds = {2,3,4,1,4,1,....}

xIds[i] never equals yIds[i]

I'm compiling with gcc 5.30 as follows: g++ -g -Wall -m64 -O3 -std=c++11

I've been profiling how much time this piece of code (i.e. code snippet 1) takes when n equals 10k^2 and I've found that if I change the code to:

code snippet 2

for (unsigned i=0; i<n; ++i) { 
    // We cannot insert duplicates in a set so why check anyway
    _originIdSet.insert(xIds[i]);
    _originIdSet.insert(yIds[i]);
}

It will be around 5 seconds slower (total run code-snippet 1 takes around 15 seconds).

I'm wondering what is the underlying reason for the performance decrease. My first guess would be that this is caused due branch optimization (excellent explanation here), however I believe it doesn't make sense that in this situation branch optimization would only be applied if an if/else conditions is used. Hopefully somebody can clarify what is happening here.

Gio
  • 3,242
  • 1
  • 25
  • 53
  • 1
    5 seconds slower than what? Also try making 2 separate loops, one for each array. – Sopel Aug 03 '17 at 10:17
  • 5 seconds slower than the first code snippet. I don't see any underlying reason for using 2 separate loops, please clarify. – Gio Aug 03 '17 at 10:19
  • n is undefined. – Tatsuyuki Ishi Aug 03 '17 at 10:19
  • n is not undefined, as I'm explaining above, n equals 10k. Obviously I'm not showing the entire original code as this would not be relevant for the question. – Gio Aug 03 '17 at 10:20
  • 3
    _5 seconds slower_: if the first version take 1000 seconds, then 5 seconds slower is not a big deal, but if the first version takes 10 seconds then it _is_ a big deal. Please be more specific. – Jabberwocky Aug 03 '17 at 10:22
  • I cannot reproduce your generated assembly as GCC seems to have gained the ability to optimizing empty allocations out. Please add your input code. – Tatsuyuki Ishi Aug 03 '17 at 10:23
  • 3
    @Gio 5s compared to let's say 24h run is a negligible difference due to random factors. That is what the question of Sopel was about. 5s is a meaningless piece of info on it's own. – luk32 Aug 03 '17 at 10:24
  • Out of curiosity, have you tried using _originIdSet.insert( xIds.begin(), xIds.end() ) and _originIdSet.insert( yIds.begin(), yIds.end() ) ? – Alessandro Teruzzi Aug 03 '17 at 10:27
  • @Aless, Yes I've also tried that, it is equally slower. – Gio Aug 03 '17 at 10:30
  • I'm voting to close this as too broad. I tried to reproduce on [godbolt](https://godbolt.org/g/uj7ksf) but basically got shorter assembly for the second one, which means OP's behavior is probably not reproducible. – Tatsuyuki Ishi Aug 03 '17 at 10:33
  • 1
    @TatsuyukiIshi 1+ OP should really create an example, that people can copy-paste and analyze, that includes the data. Otherwise, it's just wild guessing. I can imagine `find` version being faster when there are dupes, because finding might be faster than hashing, especially if there is a few distinct elements. That wouldn't show in the assembly, but it's impossible to verify. – luk32 Aug 03 '17 at 10:36
  • 1
    To the OP: use a profiler, probably based on hardware counter. `perf` can check if your guess about branch prediction is correct or not, by showing the number of branch hit and miss. – Tatsuyuki Ishi Aug 03 '17 at 10:38
  • @luk, unfortunately I cannot include all the data, as this comes from a matrix with 10.000 * 10.000 entries, however I've edited the question in order to provide some more clarifications regarding the input. I understand this question is not 100% reproducible, however I hoped there was a logical explanation and that this should not be necessary (perhaps I'm wrong in that). – Gio Aug 03 '17 at 10:42
  • I created reproduced example with data and time meausurement here http://ideone.com/poQcPV (with check before insert) and here http://ideone.com/TWjWKu (without check before insert). I see ~5% of time difference 3200ms vs 3400ms – knst Aug 03 '17 at 11:29
  • @Gio 2 loops so you don't jump as much through memory, better for cache – Sopel Aug 03 '17 at 11:33
  • And if replace to array xIds and yIds to one array, than time is same for both cases – knst Aug 03 '17 at 11:47

1 Answers1

0

Here is two samples:

#include <unordered_set>
#include <vector>
#include <chrono>
#include <iostream>

using namespace std;

int main() {
    unsigned n = 10000000;
    unordered_set<uint32_t> _originIdSet;
    vector<uint32_t> xIds, yIds;
    for (unsigned i = 0; i < n; ++i) {
        xIds.push_back(rand() % n);
        yIds.push_back(rand() % n);
    }

    auto start = chrono::steady_clock::now();
    for (unsigned i = 0; i < n; ++i) {
        auto xIdIt = _originIdSet.find(xIds[i]);
        auto yIdIt = _originIdSet.find(yIds[i]);
        if (xIdIt == _originIdSet.end()) { _originIdSet.insert(xIds[i]); }
        if (yIdIt == _originIdSet.end()) { _originIdSet.insert(yIds[i]); }
    }

    auto end = chrono::steady_clock::now();
    std::cout << "inserted " << _originIdSet.size() << " to " << chrono::duration_cast<chrono::milliseconds>(end-start).count() << " milliseconds" << std::endl;
}

And without any checking:

    for (unsigned i = 0; i < n; ++i) {
        _originIdSet.insert(xIds[i]);
        _originIdSet.insert(yIds[i]);
    }

I see ~5% of time difference 3200ms vs 3400ms.

But if i replace two array xIds & yIds to one array that contains all elements from both array I get same time for both cases and it works slowly than two array.

I think, that general reason, it is because method find() doesn't throw any exception and it is constant. Modern cpu is able to do more than 1 instruction in one tact for each thread, than you get un-obviously parallel running of two find().

Or, second possibilities, gcc is able to generate SSE code for calculation of 2 hash values and search 2 position in set in same moment for two calls of find().

knst
  • 523
  • 2
  • 16